跳到主要内容

2025-05-21-12-08

Language and Thought: The View from LLMs

Abstract

arXiv:2505.13561v1 Announce Type: new Abstract: Daniel Dennett speculated in Kinds of Minds 1996: "Perhaps the kind of mind you get when you add language to it is so different from the kind of mind you can have without language that calling them both minds is a mistake." Recent work in AI can be seen as testing Dennett's thesis by exploring the performance of AI systems with and without linguistic training. I argue that the success of Large Language Models at inferential reasoning, limited though it may be, supports Dennett's radical view about the effect of language on thought. I suggest it is the abstractness and efficiency of linguistic encoding that lies behind the capacity of LLMs to perform inferences across a wide range of domains. In a slogan, language makes inference computationally tractable. I assess what these results in AI indicate about the role of language in the workings of our own biological minds.

摘要

丹尼尔·丹尼特在1996年出版的《心灵种种》中提出猜想:'或许,添加语言能力的心灵与不具备语言能力的心灵差异如此之大,将两者统称为'心灵'可能是一种错误。'近期人工智能领域的研究可视为对丹尼特命题的检验,通过比较接受语言训练与未接受语言训练的AI系统表现来验证这一观点。本文认为,尽管存在局限性,但大型语言模型在推理任务上取得的成功支持了丹尼特关于语言对思维影响的激进主张。笔者认为,正是语言编码的抽象性和高效性,使得大语言模型能够跨领域进行推理。简言之,语言使推理在计算层面变得可行。最后,本文评估了这些AI研究成果对我们理解生物大脑中语言作用机制的启示。


BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

Abstract

arXiv:2505.13529v1 Announce Type: new Abstract: Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.

摘要

大型推理模型(LRMs)的最新进展在数学和逻辑推理方面展现出令人印象深刻的能力。然而,当前LRMs极少承认无知或回应"我不知道",反而经常在表现出不当自信的同时产生错误答案,这引发了对其事实可靠性的担忧。在本研究中,我们识别出两种由过度思考导致的病态推理模式——最后一刻猜测和二次思考螺旋,这些模式导致了过度自信的错误答案。为解决这些问题,我们提出BARREL这一新颖框架,旨在促进简洁且边界感知的事实推理。实验表明,BARREL训练将DeepSeek-R1-Distill-Llama-8B的可靠性从39.33%提升至61.48%,同时仍保持与基于R1生成推理数据微调的模型相当的准确度。这些结果证明,我们的初步研究对构建更可靠、更注重事实的系统2型LRMs具有启发意义。


Evaluating Large Language Models for Real-World Engineering Tasks

Abstract

arXiv:2505.13484v1 Announce Type: new Abstract: Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases, often adapted from examination materials where correctness is easily verifiable, and (ii) the use of ad hoc scenarios that insufficiently capture critical engineering competencies. Consequently, the assessment of LLMs on complex, real-world engineering problems remains largely unexplored. This paper addresses this gap by introducing a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios, systematically designed to cover core competencies such as product design, prognosis, and diagnosis. Using this dataset, we evaluate four state-of-the-art LLMs, including both cloud-based and locally hosted instances, to systematically investigate their performance on complex engineering tasks. Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.

摘要

大语言模型(LLMs)不仅对日常活动具有变革性影响,在工程任务中也展现出巨大潜力。然而,当前针对LLMs的工程能力评估存在两个关键缺陷:(1)依赖简化的用例,这些用例通常改编自易于验证正确性的考试材料;(2)采用临时性场景,未能充分捕捉关键工程能力。因此,LLMs在复杂现实工程问题上的表现仍属未知领域。本文通过构建一个精选数据库填补这一空白,该数据库包含100多个源自真实生产导向工程场景的问题,系统性地涵盖产品设计、预测与诊断等核心能力。基于该数据集,我们评估了四种最先进的LLMs(包括云端和本地部署实例),以系统研究其在复杂工程任务中的表现。结果表明,LLMs在基础时空推理和结构推理方面表现突出,但在抽象推理、形式化建模以及上下文敏感的工程逻辑方面存在显著不足。


Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer

Abstract

arXiv:2505.13489v1 Announce Type: new Abstract: Knowledge tracing (KT) aims to predict learners' future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners' knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph guided knowledge transfer to model the relationships between learning behaviors across different courses, thereby enhancing knowledge state estimation. Specifically, TransKT constructs a cross-course concept graph by leveraging zero-shot Large Language Model (LLM) prompts to establish implicit links between related concepts across different courses. This graph serves as the foundation for knowledge transfer, enabling the model to integrate and enhance the semantic features of learners' interactions across courses. Furthermore, TransKT includes an LLM-to-LM pipeline for incorporating summarized semantic features, which significantly improves the performance of Graph Convolutional Networks (GCNs) used for knowledge transfer. Additionally, TransKT employs a contrastive objective that aligns single-course and cross-course knowledge states, thereby refining the model's ability to provide a more robust and accurate representation of learners' overall knowledge states.

摘要

知识追踪(KT)旨在基于历史学习交互预测学习者的未来表现。然而,现有KT模型主要关注单一课程数据,限制了其全面理解学习者知识状态的能力。本文提出TransKT——一种对比式跨课程知识追踪方法,通过概念图引导的知识迁移来建模不同课程间学习行为的关联,从而提升知识状态估计效果。具体而言,TransKT利用零样本大语言模型(LLM)提示构建跨课程概念图,建立不同课程相关概念间的隐含联系。该图作为知识迁移的基础,使模型能够整合并增强跨课程学习交互的语义特征。此外,TransKT采用LLM-to-LM管道融入语义特征摘要,显著提升了用于知识迁移的图卷积网络(GCN)性能。同时,该方法通过对比目标函数对齐单课程与跨课程知识状态,从而优化模型对学习者整体知识状态的表征能力,使其更具鲁棒性和准确性。


AgentSGEN: Multi-Agent LLM in the Loop for Semantic Collaboration and GENeration of Synthetic Data

Abstract

arXiv:2505.13466v1 Announce Type: new Abstract: The scarcity of data depicting dangerous situations presents a major obstacle to training AI systems for safety-critical applications, such as construction safety, where ethical and logistical barriers hinder real-world data collection. This creates an urgent need for an end-to-end framework to generate synthetic data that can bridge this gap. While existing methods can produce synthetic scenes, they often lack the semantic depth required for scene simulations, limiting their effectiveness. To address this, we propose a novel multi-agent framework that employs an iterative, in-the-loop collaboration between two agents: an Evaluator Agent, acting as an LLM-based judge to enforce semantic consistency and safety-specific constraints, and an Editor Agent, which generates and refines scenes based on this guidance. Powered by LLM's capabilities to reasoning and common-sense knowledge, this collaborative design produces synthetic images tailored to safety-critical scenarios. Our experiments suggest this design can generate useful scenes based on realistic specifications that address the shortcomings of prior approaches, balancing safety requirements with visual semantics. This iterative process holds promise for delivering robust, aesthetically sound simulations, offering a potential solution to the data scarcity challenge in multimedia safety applications.

摘要

描述危险场景的数据稀缺性对训练面向安全关键应用(如建筑施工安全)的AI系统构成了主要障碍,其中伦理和物流障碍阻碍了真实世界数据收集。这迫切需要一种端到端框架来生成能够弥补这一差距的合成数据。虽然现有方法可以生成合成场景,但它们往往缺乏场景模拟所需的语义深度,限制了其有效性。为解决这一问题,我们提出了一种新颖的多智能体框架,采用两个智能体之间的迭代式闭环协作:评估智能体作为基于大型语言模型(LLM)的裁判,强制执行语义一致性和安全特定约束;编辑智能体则根据这一指导生成并优化场景。借助LLM的推理能力和常识知识,这种协作设计能够生成针对安全关键场景定制的合成图像。实验表明,该框架可根据现实规范生成有效场景,解决现有方法的不足,在安全要求与视觉语义之间实现平衡。这种迭代过程有望提供鲁棒且美学合理的模拟,为多媒体安全应用中的数据稀缺挑战提供潜在解决方案。


Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

Abstract

arXiv:2505.13546v1 Announce Type: new Abstract: Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.

摘要

自动提示生成在实现通用多智能体系统自主执行多样化任务中起着关键作用。现有方法通常基于即时任务表现评估提示,忽视了决定其可靠性的内在特质。这种以结果为中心的视角不仅限制了可解释性,也无法充分考虑大语言模型(LLMs)固有的随机性。本研究提出提示稳定性——模型在重复执行中响应的一致性——作为构建鲁棒高效提示生成系统的关键因素。为量化该特性,我们提出语义稳定性作为评估提示响应一致性的标准,并微调基于LLaMA的评估器以实现跨任务的自动测量。这些组件使我们开发出首个具备稳定性感知的通用提示生成系统,该系统利用稳定性反馈迭代提升提示质量和系统级性能。进一步地,通过分析系统内部结构依赖性,我们建立了提示稳定性与任务成功之间的逻辑链条,证明稳定性是实现有效系统级执行的必要条件。在通用和领域特定任务中的实证结果表明,我们的稳定性感知框架同时提高了准确性和输出一致性。通过将关注点从一次性结果转向持久可靠性,本研究为提示设计提供了新视角,并为构建更可信的通用系统贡献了实用工具。


Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

Abstract

arXiv:2505.13737v1 Announce Type: new Abstract: We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models. CHG learns soft gates over heads and assigns them a causal taxonomy - facilitating, interfering, or irrelevant - based on their impact on task performance. Unlike prior approaches in mechanistic interpretability, which are hypothesis-driven and require prompt templates or target labels, CHG applies directly to any dataset using standard next-token prediction. We evaluate CHG across multiple large language models (LLMs) in the Llama 3 model family and diverse tasks, including syntax, commonsense, and mathematical reasoning, and show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses. We also introduce contrastive CHG, a variant that isolates sub-circuits for specific task components. Our findings reveal that LLMs contain multiple sparse, sufficient sub-circuits, that individual head roles depend on interactions with others (low modularity), and that instruction following and in-context learning rely on separable mechanisms.

摘要

我们提出因果头门控(CHG),一种可扩展的方法,用于解释Transformer模型中注意力头的功能角色。CHG通过学习头部的软门控机制,并根据其对任务性能的影响将其归类为促进型、干扰型或无关型,从而建立因果分类体系。与机械可解释性研究中假设驱动、需要提示模板或目标标签的传统方法不同,CHG可直接应用于任何采用标准下一词预测任务的数据集。我们在Llama 3系列多个大语言模型(LLM)上评估CHG,涵盖句法、常识和数学推理等多样化任务,结果表明CHG分数能提供因果性(而非仅相关性)洞见,这一结论通过消融实验和因果中介分析得到验证。我们还提出对比式CHG变体,可分离特定任务组件的子电路。研究发现表明:大语言模型包含多个稀疏且充分的子电路;单个头部的角色取决于与其他头部的交互(低模块性);指令跟随和上下文学习依赖于可分离的机制。


FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs

Abstract

arXiv:2505.13533v1 Announce Type: new Abstract: Financial tasks are pivotal to global economic stability; however, their execution faces challenges including labor intensive processes, low error tolerance, data fragmentation, and tool limitations. Although large language models (LLMs) have succeeded in various natural language processing tasks and have shown potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance lack sufficient domain-specific data, have simplistic task design, and incomplete evaluation frameworks. To address these gaps, this article presents FinMaster, a comprehensive financial benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, FinMaster comprises three main modules: i) FinSim, which builds simulators that generate synthetic, privacy-compliant financial data for companies to replicate market dynamics; ii) FinSuite, which provides tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) FinEval, which develops a unified interface for evaluation. Extensive experiments over state-of-the-art LLMs reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 40% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations initially demonstrating 58% accuracy decreased to 37% in multimetric scenarios. To the best of our knowledge, FinMaster is the first benchmark that covers full-pipeline financial workflows with challenging tasks. We hope that FinMaster can bridge the gap between research and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance efficiency and accuracy.

摘要

金融任务对全球经济稳定至关重要,但其执行过程面临劳动力密集、容错率低、数据碎片化及工具局限等挑战。尽管大语言模型(LLMs)在各类自然语言处理任务中表现优异,并通过推理与上下文理解展现出工作流自动化潜力,当前金融领域LLM评估基准仍存在领域数据不足、任务设计过于简单、评估框架不完善等问题。为填补这些空白,本文提出FinMaster——一个系统评估LLM在金融素养、会计、审计及咨询领域能力的综合性金融基准。具体而言,FinMaster包含三大核心模块:i) FinSim通过构建仿真器生成符合隐私要求的合成企业财务数据以模拟市场动态;ii) FinSuite提供涵盖183项多类型、多难度任务的核心金融领域测试集;iii) FinEval开发标准化评估接口。基于前沿LLM的大规模实验揭示了金融推理中的关键能力缺陷:基础任务准确率超过90%,而需要多步推理的复杂场景骤降至40%。这种性能退化表现为计算误差传导现象——单指标计算初始准确率为58%,在多指标场景中降至37%。据我们所知,FinMaster是首个覆盖全流程金融工作流且包含高难度任务的基准测试。我们期望FinMaster能弥合学术界与业界的鸿沟,推动LLM在真实金融实践中的应用,从而提升效率与准确性。


A*-Decoding: Token-Efficient Inference Scaling

Abstract

arXiv:2505.13672v1 Announce Type: new Abstract: Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance on complex reasoning tasks. While existing methods have shown strong performance gains under fixed compute budgets, there has been little focus on optimally utilizing that budget during inference. In this work, we introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget by prioritizing high-quality reasoning paths during generation. We frame language model decoding as a structured search in a state space of partial solutions, applying the A* transition model to identify promising continuations guided by an external process supervision signal. In our experiments, A*-decoding reaches the performance levels of strong inference scaling baselines like best-of-N and particle filtering while using up to 3x fewer tokens and 30% fewer PRM passes under equivalent compute budgets. On the MATH500 and AIME 2024 benchmarks, A*-decoding enables Llama-3.2-1B-Instruct to match the performance of the 70x larger Llama-3.1-70B-Instruct, and allows Qwen3-1.7B to reach o1-like reasoning accuracy. These results highlight the power of structured search in decoding, offering an alternative to brute-force sampling or scale-driven gains. Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.

摘要

推理时缩放已成为参数缩放的有力替代方案,用于提升语言模型在复杂推理任务上的性能。尽管现有方法在固定计算预算下展现出显著的性能增益,但如何在该预算内实现最优推理利用却鲜有研究。本文提出A解码策略,这是一种基于搜索的推理时方法,通过A搜索算法在生成过程中优先选择高质量推理路径,从而优化固定计算预算的利用。我们将语言模型解码框架化为局部解状态空间中的结构化搜索,应用A转移模型在外部分步监督信号引导下识别最有潜力的续写路径。实验表明,在同等计算预算下,A解码能达到最优N采样和粒子过滤等强推理缩放基线的性能水平,同时减少高达3倍的token消耗和30%的PRM验证次数。在MATH500和AIME 2024基准测试中,A*解码使Llama-3.2-1B-Instruct达到70倍参数量级模型Llama-3.1-70B-Instruct的性能,并让Qwen3-1.7B达到o1级别的推理准确率。这些结果凸显了结构化搜索在解码中的强大潜力,为暴力采样或规模驱动增益提供了替代方案。我们的工作表明,精心设计的推理时策略能够增强小语言模型的推理能力,为未来实现更高效、可扩展的语言模型部署指明了方向。


Q{}^2Forge: Minting Competency Questions and SPARQL Queries for Question-Answering Over Knowledge Graphs

Abstract

arXiv:2505.13572v1 Announce Type: new Abstract: The SPARQL query language is the standard method to access knowledge graphs (KGs). However, formulating SPARQL queries is a significant challenge for non-expert users, and remains time-consuming for the experienced ones. Best practices recommend to document KGs with competency questions and example queries to contextualise the knowledge they contain and illustrate their potential applications. In practice, however, this is either not the case or the examples are provided in limited numbers. Large Language Models (LLMs) are being used in conversational agents and are proving to be an attractive solution with a wide range of applications, from simple question-answering about common knowledge to generating code in a targeted programming language. However, training and testing these models to produce high quality SPARQL queries from natural language questions requires substantial datasets of question-query pairs. In this paper, we present Q{}^2Forge that addresses the challenge of generating new competency questions for a KG and corresponding SPARQL queries. It iteratively validates those queries with human feedback and LLM as a judge. Q{}^2Forge is open source, generic, extensible and modular, meaning that the different modules of the application (CQ generation, query generation and query refinement) can be used separately, as an integrated pipeline, or replaced by alternative services. The result is a complete pipeline from competency question formulation to query evaluation, supporting the creation of reference query sets for any target KG.

摘要

SPARQL查询语言是访问知识图谱(KGs)的标准方法。然而,对于非专业用户而言,编写SPARQL查询是一项重大挑战,即使对于有经验的用户也依然耗时。最佳实践建议通过能力问题和示例查询来记录知识图谱,以情境化其包含的知识并展示其潜在应用。但在实践中,这种做法往往缺失或仅提供有限数量的示例。大型语言模型(LLMs)正被用于对话代理,并证明是一种具有广泛应用的解决方案,从关于常识的简单问答到生成目标编程语言的代码。然而,训练和测试这些模型以从自然语言问题生成高质量的SPARQL查询,需要大量的问题-查询对数据集。本文提出Q{}^2Forge,旨在解决为知识图谱生成新能力问题及相应SPARQL查询的挑战。该方法通过人类反馈和LLM作为评判者迭代验证这些查询。Q{}^2Forge是开源、通用、可扩展和模块化的,意味着应用程序的不同模块(能力问题生成、查询生成和查询优化)可以单独使用、作为集成管道或替换为替代服务。其结果是一个从能力问题表述到查询评估的完整流程,支持为任何目标知识图谱创建参考查询集。


HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems

Abstract

arXiv:2505.13516v1 Announce Type: new Abstract: Recent advancements in Multi-Agent Systems (MAS) powered by Large Language Models (LLMs) have demonstrated tremendous potential in diverse task scenarios. Nonetheless, existing agentic systems typically rely on predefined agent-role design spaces and static communication structures, limiting their adaptability as well as flexibility in complex interaction environments and leading to subpar performance on highly specialized and expert-level tasks. To address these issues, we introduce HALO, a multi-agent collaboration framework based on a hierarchical reasoning architecture. Specifically, we incorporate a high-level planning agent for task decomposition, mid-level role-design agents for subtask-specific agent instantiation, and low-level inference agents for subtask execution. Particularly, subtask execution is reformulated as a structured workflow search problem, where Monte Carlo Tree Search (MCTS) systematically explores the agentic action space to construct optimal reasoning trajectories. Additionally, as the majority of users lack expertise in prompt engineering, we leverage an Adaptive Prompt Refinement module to transform raw queries into task-specific prompts. Empirical evaluations on Code Generation (HumanEval), General Reasoning (MMLU), and Arithmetic Reasoning (MATH) benchmark datasets highlight the effectiveness of HALO, yielding a 14.4% average improvement over state-of-the-art baselines. Notably, HALO achieves up to 13.3% performance gain on the Moral Scenarios subject in the MMLU benchmark and up to 19.6% performance gain on the Algebra subarea in the MATH benchmark, indicating its advanced proficiency in tackling highly specialized and expert-level tasks. The code repository is available at https://github.com/23japhone/HALO.

摘要

近年来,基于大语言模型(LLMs)的多智能体系统(MAS)在多样化任务场景中展现出巨大潜力。然而,现有智能体系统通常依赖于预定义的智能体角色设计空间和静态通信结构,这限制了其在复杂交互环境中的适应性与灵活性,导致在高度专业化和专家级任务上表现欠佳。为解决这些问题,我们提出了HALO——一种基于分层推理架构的多智能体协作框架。具体而言,该框架包含高层规划智能体(负责任务分解)、中层角色设计智能体(负责子任务导向的智能体实例化)以及底层推理智能体(负责子任务执行)。特别地,我们将子任务执行重构为结构化工作流搜索问题,通过蒙特卡洛树搜索(MCTS)系统性地探索智能体动作空间以构建最优推理轨迹。此外,针对大多数用户缺乏提示词工程专业知识的现状,我们采用自适应提示词优化模块将原始查询转化为任务专属提示词。在代码生成(HumanEval)、通用推理(MMLU)和数学推理(MATH)基准数据集上的实证评估表明,HALO相较于最先进基线模型平均提升14.4%的性能。值得注意的是,HALO在MMLU基准的道德情景科目中最高获得13.3%的性能提升,在MATH基准的代数子领域最高获得19.6%的性能提升,这印证了其在处理高度专业化和专家级任务方面的卓越能力。代码仓库详见https://github.com/23japhone/HALO。


Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Abstract

arXiv:2505.13774v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multi-path Chain-of-Thought explorations before producing final answers. Ensuring the faithfulness of these intermediate reasoning processes is crucial for reliable monitoring, interpretation, and effective control. In this paper, we propose a systematic counterfactual intervention framework to rigorously evaluate thinking draft faithfulness. Our approach focuses on two complementary dimensions: (1) Intra-Draft Faithfulness, which assesses whether individual reasoning steps causally influence subsequent steps and the final draft conclusion through counterfactual step insertions; and (2) Draft-to-Answer Faithfulness, which evaluates whether final answers are logically consistent with and dependent on the thinking draft, by perturbing the draft's concluding logic. We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions. These results underscore the need for more faithful and interpretable reasoning in advanced LRMs.

摘要

大型推理模型(LRMs)通过引入思维草稿机制显著提升了复杂问题解决能力,该机制可在生成最终答案前进行多路径思维链探索。确保这些中间推理过程的忠实性对于可靠监控、解释和有效控制至关重要。本文提出一个系统的反事实干预框架来严格评估思维草稿的忠实性。我们的方法聚焦两个互补维度:(1)草稿内忠实性——通过反事实步骤插入,评估单个推理步骤是否因果性影响后续步骤及草稿最终结论;(2)草稿到答案的忠实性——通过扰动草稿的结论逻辑,评估最终答案是否与思维草稿保持逻辑一致性和依赖性。我们在六种最先进的LRM上进行了广泛实验。研究发现当前LRMs对中间推理步骤仅表现出选择性忠实,且经常无法与草稿结论保持可靠一致。这些结果凸显了先进LRMs需要更具忠实性和可解释性的推理机制。


Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Abstract

arXiv:2505.13718v1 Announce Type: new Abstract: Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights & Knaves (K&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i)(i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval^{+}, and MMLU-Pro. (ii)(ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (100\leq100 examples), the warmed-up model consistently outperforms the base model; (iii)(iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv)(iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

摘要

设计具备有效推理能力的大型语言模型(LLM)通常需要采用可验证奖励的强化学习(RLVR)或通过精心构建的长链思维(CoT)进行知识蒸馏,这两种方法都高度依赖大量训练数据。当优质训练数据稀缺时,这便形成了重大挑战。我们提出了一种样本高效的二阶段训练策略,用于在有限监督下开发具备推理能力的LLM。第一阶段,我们通过从玩具领域(即骑士与无赖(K&K)逻辑谜题)中蒸馏长链思维来"预热"模型,使其获得通用推理能力;第二阶段,我们使用少量目标域样本对预热后的模型进行RLVR训练。实验表明该双阶段方法具有以下优势:(i)仅预热阶段即可促进泛化推理能力,在MATH、HumanEval⁺和MMLU-Pro等多个任务上实现性能提升;(ii)当基础模型与预热模型在相同小规模数据集(≤100样本)上接受RLVR训练时,预热模型始终优于基础模型;(iii)RLVR训练前进行预热能使模型在特定领域训练后仍保持跨领域泛化能力;(iv)引入预热流程不仅能提升准确率,还能提高RLVR训练的整体样本效率。本文结果凸显了预热策略在数据稀缺环境下构建鲁棒推理LLM的潜力。


Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

Abstract

arXiv:2505.13511v1 Announce Type: new Abstract: This study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development. This work presents a new benchmark that evaluates LLMs on freelance programming and data analysis tasks derived from economic data. We construct the benchmark using synthetic tasks created from a Kaggle Freelancer dataset of job postings, with all job prices standardized to USD (median fixed-project price around 250,andanaverageof250, and an average of 306). Each task is accompanied by structured input-output test cases and an estimated price tag, enabling automated correctness checking and a monetary performance valuation. This approach is inspired by OpenAI's recent SWE-Lancer benchmark (1,400 real Upwork tasks worth 1Mtotal).Still,ourframeworksimplifiesevaluationusingprogrammaticallytestabletasksandpredictedpricevalues,makingithighlyscalableandrepeatable.Onthisbenchmark,weevaluatefourmodernLLMsClaude3.5Haiku,GPT4omini,Qwen2.5,andMistral.Wereporteachmodelsaccuracy(tasksuccessrateandtestcasepassrate)andthetotal"freelanceearnings"itachieves(sumofpricesofsolvedtasks).OurresultsshowthatClaude3.5Haikuperformsbest,earningapproximately1M total). Still, our framework simplifies evaluation using programmatically testable tasks and predicted price values, making it highly scalable and repeatable. On this benchmark, we evaluate four modern LLMs - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral. We report each model's accuracy (task success rate and test-case pass rate) and the total "freelance earnings" it achieves (sum of prices of solved tasks). Our results show that Claude 3.5 Haiku performs best, earning approximately 1.52 million USD, followed closely by GPT-4o-mini at 1.49million,thenQwen2.5(1.49 million, then Qwen 2.5 (1.33M) and Mistral ($0.70M). We analyze the distribution of errors per task and observe that the strongest models solve the most tasks and rarely fail completely on any project. We discuss the implications of these results for the feasibility of AI as a freelance developer, the advantages and limitations of our automated benchmark approach, and the gap between performance on structured tasks versus the true complexity of real-world freelance jobs.

摘要

本研究探讨了将大语言模型(LLMs)作为自主智能体应用于现实世界任务(包括自由职业软件开发)的可行性。我们提出了一种新型基准测试,通过基于经济数据衍生的自由职业编程与数据分析任务来评估LLMs性能。该基准采用Kaggle自由职业者数据集中的职位发布信息构建合成任务,所有任务价格均标准化为美元计价(固定项目价格中位数约250美元,平均306美元)。每个任务均配备结构化输入-输出测试用例及预估价格标签,支持自动化正确性检验与货币化绩效评估。该方法受OpenAI近期SWE-Lancer基准(包含总价值100万美元的1,400个真实Upwork任务)启发,但本框架通过程序化可测试任务和预测价格值简化了评估流程,使其具备高度可扩展性和可重复性。我们在该基准上评估了四种现代LLMs——Claude 3.5 Haiku、GPT-4o-mini、Qwen 2.5和Mistral,报告了各模型的准确率(任务成功率和测试用例通过率)及实现的'自由职业收入'(已完成任务价格总和)。结果显示Claude 3.5 Haiku表现最佳,收入约152万美元,GPT-4o-mini以149万美元紧随其后,其次是Qwen 2.5(133万美元)和Mistral(70万美元)。我们分析了每项任务的错误分布,发现最强模型不仅能解决最多任务,且极少在项目中完全失败。最后讨论了这些结果对AI作为自由职业开发者可行性的启示、自动化基准方法的优势与局限,以及结构化任务表现与现实自由职业工作真实复杂性之间的差距。


Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Abstract

arXiv:2505.13763v1 Announce Type: new Abstract: Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, but they can also fail to do so. This suggests some degree of metacognition -- the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognitive abilities enhance AI capabilities but raise safety concerns, as models might obscure their internal processes to evade neural-activation-based oversight mechanisms designed to detect harmful behaviors. Given society's increased reliance on these models, it is critical that we understand the limits of their metacognitive abilities, particularly their ability to monitor their internal activations. To address this, we introduce a neuroscience-inspired neurofeedback paradigm designed to quantify the ability of LLMs to explicitly report and control their activation patterns. By presenting models with sentence-label pairs where labels correspond to sentence-elicited internal activations along specific directions in the neural representation space, we demonstrate that LLMs can learn to report and control these activations. The performance varies with several factors: the number of example pairs provided, the semantic interpretability of the target neural direction, and the variance explained by that direction. These results reveal a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a subset of their neural mechanisms. Our findings provide empirical evidence quantifying metacognitive capabilities in LLMs, with significant implications for AI safety.

摘要

大型语言模型(LLMs)有时能报告其解决任务时实际采用的策略,但也可能无法做到这一点。这表明其具备一定程度的元认知能力——即监控自身认知过程以进行后续报告和自我控制的能力。元认知能力虽能增强AI性能,却引发安全隐忧:模型可能通过隐藏内部过程来逃避基于神经激活的有害行为检测机制。随着社会对这些模型的依赖日益加深,理解其元认知能力的边界(特别是监控内部激活的能力)至关重要。为此,我们提出一种神经科学启发的神经反馈范式,用于量化LLMs显式报告和控制其激活模式的能力。通过向模型输入句子-标签对(标签对应句子在神经表征空间特定方向上引发的内部激活),我们证明LLMs能够学会报告并控制这些激活。其表现受以下因素影响:提供的示例对数、目标神经方向的语义可解释性,以及该方向解释的方差。这些结果揭示了一个维度远低于模型神经空间的"元认知空间",表明LLMs仅能监控其神经机制的子集。本研究为量化LLMs元认知能力提供了实证依据,对AI安全具有重要启示。


Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Abstract

arXiv:2505.13770v1 Announce Type: new Abstract: Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.

摘要

可靠因果推断对于医学、经济学和公共政策等高风险领域的决策至关重要。然而,目前尚不清楚大语言模型(LLMs)能否进行严谨且可信的统计因果推断。现有基准测试通常包含简化任务,例如仅要求LLMs识别语义因果关系或直接从原始数据得出结论。这可能导致模型忽视重要的统计陷阱,如辛普森悖论或选择偏差,从而限制LLMs在现实世界的适用性。为解决这些局限,我们提出CausalPitfalls——一个旨在严格评估LLMs克服常见因果推断陷阱能力的综合基准。该基准具有跨多难度层级的结构化挑战,每个层级均配备评分标准,使我们能定量测量因果推理能力及模型回答的可靠性。我们采用两种协议评估模型:(1)直接提示法,用于评估内在因果推理能力;(2)代码辅助提示法,要求模型生成可执行代码进行显式统计分析。此外,通过将自动评分与人类专家评估对比,我们验证了该评判体系的有效性。研究结果揭示了当前LLMs在统计因果推断方面存在显著局限性。CausalPitfalls基准为推进可信因果推理系统的发展提供了关键指导与量化指标。


TelePlanNet: An AI-Driven Framework for Efficient Telecom Network Planning

Abstract

arXiv:2505.13831v1 Announce Type: new Abstract: The selection of base station sites is a critical challenge in 5G network planning, which requires efficient optimization of coverage, cost, user satisfaction, and practical constraints. Traditional manual methods, reliant on human expertise, suffer from inefficiencies and are limited to an unsatisfied planning-construction consistency. Existing AI tools, despite improving efficiency in certain aspects, still struggle to meet the dynamic network conditions and multi-objective needs of telecom operators' networks. To address these challenges, we propose TelePlanNet, an AI-driven framework tailored for the selection of base station sites, integrating a three-layer architecture for efficient planning and large-scale automation. By leveraging large language models (LLMs) for real-time user input processing and intent alignment with base station planning, combined with training the planning model using the improved group relative policy optimization (GRPO) reinforcement learning, the proposed TelePlanNet can effectively address multi-objective optimization, evaluates candidate sites, and delivers practical solutions. Experiments results show that the proposed TelePlanNet can improve the consistency to 78%, which is superior to the manual methods, providing telecom operators with an efficient and scalable tool that significantly advances cellular network planning.

摘要

基站选址是5G网络规划中的关键挑战,需要高效优化覆盖范围、成本、用户满意度及实际约束条件。传统人工方法依赖专家经验,存在效率低下且规划与建设一致性不足的缺陷。现有AI工具虽在某些方面提升了效率,但仍难以满足电信运营商网络动态条件与多目标需求。为此,我们提出TelePlanNet——一个专为基站选址设计的AI驱动框架,通过三层架构实现高效规划与大规模自动化。该框架利用大语言模型(LLMs)实时处理用户输入并实现基站规划意图对齐,结合改进的群体相对策略优化(GRPO)强化学习训练规划模型,能有效处理多目标优化问题、评估候选站点并提供实用解决方案。实验结果表明,TelePlanNet将规划一致性提升至78%,优于人工方法,为电信运营商提供了显著提升蜂窝网络规划效率的可扩展工具。


CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs

Abstract

arXiv:2505.13778v1 Announce Type: new Abstract: As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity and semantic validity of hidden tokens. CoIn constructs a verifiable hash tree from token embedding fingerprints to check token counts, and uses embedding-based relevance matching to detect fabricated reasoning content. Experiments demonstrate that CoIn, when deployed as a trusted third-party auditor, can effectively detect token count inflation with a success rate reaching up to 94.7%, showing the strong ability to restore billing transparency in opaque LLM services. The dataset and code are available at https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn.

摘要

随着训练后优化技术的发展,大型语言模型(LLMs)正日益通过结构化多步推理能力得到增强,这种优化通常通过强化学习实现。具备增强推理能力的模型在复杂任务上表现优于标准LLMs,目前已成为众多商业LLM API的核心组件。然而,为保护专有行为特征并减少冗余输出,服务提供商通常隐藏推理过程痕迹,仅返回最终答案。这种不透明性导致了关键的可信度缺口:用户为不可见的推理标记付费(这些标记往往占据成本的主要部分),却无法验证其真实性。这使得标记数量虚报成为可能——服务商可能夸大上报标记使用量,或注入低质量合成标记以抬高费用。针对该问题,我们提出CoIn验证框架,可同时对隐藏标记的数量和语义有效性进行审计。CoIn通过从标记嵌入指纹构建可验证哈希树来检查标记数量,并采用基于嵌入的相关性匹配检测伪造的推理内容。实验表明,当作为可信第三方审计工具部署时,CoIn能以高达94.7%的成功率有效识别标记虚报行为,展现出恢复不透明LLM服务计费透明性的强大能力。数据集与代码已发布于https://github.com/CASE-Lab-UMD/LLM-Auditing-CoIn。


LLM-based Evaluation Policy Extraction for Ecological Modeling

Abstract

arXiv:2505.13794v1 Announce Type: new Abstract: Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables, but they often fail to capture domain-specific temporal patterns critical to ecological processes. As a result, these methods are often accompanied by expert visual inspection, which requires substantial human labor and limits the applicability to large-scale evaluation. To address these challenges, we propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction to develop interpretable evaluation criteria. The proposed method processes pairwise annotations and implements a policy optimization mechanism to generate and combine different assessment metrics. The results obtained on multiple datasets for evaluating the predictions of crop gross primary production and carbon dioxide flux have confirmed the effectiveness of the proposed method in capturing target assessment preferences, including both synthetically generated and expert-annotated model comparisons. The proposed framework bridges the gap between numerical metrics and expert knowledge while providing interpretable evaluation policies that accommodate the diverse needs of different ecosystem modeling studies.

摘要

评估生态时间序列对于许多重要应用中的模型性能基准测试至关重要,包括温室气体通量预测、碳氮动态捕捉以及水文循环监测。传统数值指标(如R平方、均方根误差)已被广泛用于量化模型与观测生态系统变量之间的相似性,但这些指标往往无法捕捉对生态过程至关重要的领域特定时间模式。因此,这些方法通常需要辅以专家视觉检查,这不仅耗费大量人力,也限制了大范围评估的适用性。为解决这些问题,我们提出了一种新颖框架,该框架将度量学习与基于大语言模型(LLM)的自然语言策略提取相结合,以制定可解释的评估标准。所提出的方法处理成对标注数据,并实施策略优化机制来生成和组合不同的评估指标。在多个数据集上对作物总初级生产力和二氧化碳通量预测进行评估的结果证实,该方法能有效捕捉目标评估偏好,包括合成生成和专家标注的模型比较。所提出的框架弥合了数值指标与专家知识之间的差距,同时提供了可解释的评估策略,以适应不同生态系统建模研究的多样化需求。


MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

Abstract

arXiv:2505.13941v1 Announce Type: new Abstract: Existing AutoML systems have advanced the automation of machine learning (ML); however, they still require substantial manual configuration and expert input, particularly when handling multimodal data. We introduce MLZero, a novel multi-agent framework powered by Large Language Models (LLMs) that enables end-to-end ML automation across diverse data modalities with minimal human intervention. A cognitive perception module is first employed, transforming raw multimodal inputs into perceptual context that effectively guides the subsequent workflow. To address key limitations of LLMs, such as hallucinated code generation and outdated API knowledge, we enhance the iterative code generation process with semantic and episodic memory. MLZero demonstrates superior performance on MLE-Bench Lite, outperforming all competitors in both success rate and solution quality, securing six gold medals. Additionally, when evaluated on our Multimodal AutoML Agent Benchmark, which includes 25 more challenging tasks spanning diverse data modalities, MLZero outperforms the competing methods by a large margin with a success rate of 0.92 (+263.6%) and an average rank of 2.28. Our approach maintains its robust effectiveness even with a compact 8B LLM, outperforming full-size systems from existing solutions.

摘要

现有AutoML系统虽已推动机器学习(ML)自动化进程,但在处理多模态数据时仍依赖大量人工配置与专家输入。本文提出MLZero——一种基于大语言模型(LLMs)驱动的新型多智能体框架,可在最小人为干预下实现跨多模态数据的端到端ML自动化。该框架首先采用认知感知模块,将原始多模态输入转化为能有效指导后续工作流的感知上下文。针对LLMs存在的代码生成幻觉和API知识陈旧等关键缺陷,我们通过语义记忆与情景记忆增强迭代式代码生成过程。MLZero在MLE-Bench Lite基准测试中表现卓越,以六项金牌的成绩在成功率和解决方案质量上全面超越竞争对手。此外,在我们构建的多模态AutoML智能体基准测试(包含25项跨数据模态的更高难度任务)中,MLZero以0.92的成功率(+263.6%)和2.28的平均排名大幅领先其他方法。即使采用紧凑型8B参数的LLM,本方案仍保持强劲性能,其表现优于现有解决方案中的全尺寸系统。


Visual Instruction Bottleneck Tuning

Abstract

arXiv:2505.13946v1 Announce Type: new Abstract: Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.

摘要

尽管多模态大语言模型(MLLMs)已被广泛采用,但其在分布偏移下遭遇陌生查询时会出现性能下降。现有提升MLLM泛化能力的方法通常需要更多指令数据或更庞大的先进模型架构,这些都会带来显著的人力或计算成本。本研究从表征学习的角度出发,采用了一种增强MLLM在分布偏移下鲁棒性的替代方案。受信息瓶颈(IB)原理启发,我们推导出MLLM的IB变分下界,并设计出实用实现方案——视觉指令瓶颈调优(Vittle)。通过揭示Vittle与MLLM信息论鲁棒性指标的关联,我们为其提供了理论依据。在45个数据集(含30个偏移场景)上对三种MLLM进行的开放式/封闭式问答及物体幻觉检测任务的实证验证表明,Vittle通过追求最小充分表征学习,能持续提升MLLM在分布偏移下的鲁棒性。


Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning

Abstract

arXiv:2505.13994v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems empower large language models (LLMs) with external knowledge, yet struggle with efficiency-accuracy trade-offs when scaling to large knowledge graphs. Existing approaches often rely on monolithic graph retrieval, incurring unnecessary latency for simple queries and fragmented reasoning for complex multi-hop questions. To address these challenges, this paper propose SPLIT-RAG, a multi-agent RAG framework that addresses these limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval. The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG. The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types, while lightweight LLM agents are assigned to partitioned subgraphs, and only relevant partitions are activated during retrieval, thus reduce search space while enhancing efficiency. Finally, a hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications. Extensive experimental validation demonstrates considerable improvements compared to existing approaches.

摘要

检索增强生成(RAG)系统通过外部知识赋能大语言模型(LLM),但在扩展至大规模知识图谱时面临效率与准确性的权衡难题。现有方法通常依赖整体图检索,导致简单查询产生不必要延迟,而复杂多跳问题则引发碎片化推理。针对这些挑战,本文提出SPLIT-RAG——一种多智能体RAG框架,通过问题驱动的语义图分区与协作式子图检索解决上述局限。该创新框架首先创建链接信息的语义分区(SPLI),随后利用类型专精知识库实现多智能体RAG。基于属性的图分割机制将知识图谱划分为语义连贯的子图,确保子图与不同查询类型对齐;同时为分区子图分配轻量级LLM智能体,仅在检索时激活相关分区,从而在提升效率的同时缩减搜索空间。最终,层级融合模块通过逻辑验证消除子图派生答案间的矛盾。大量实验验证表明,相较现有方法,本框架实现了显著性能提升。


DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery

Abstract

arXiv:2505.13940v1 Announce Type: new Abstract: In the field of AI4Science, large-scale language models (LLMs) show great potential to parse complex scientific semantics, integrate cross-disciplinary knowledge, and assist critical task research. However, in the field of drug discovery, despite the optimization through professional data pre-training, context window expansion, and internet search, the existing LLMs are still facing challenges such as massive multi-modal and heterogeneous data processing, domain knowledge dynamic updating delay, and insufficient confidence in predicting the results of complex computational tasks. To address these challenges, we propose the DrugPilot, an LLM-based agent with parameterized reasoning for drug discovery. DrugPilot addresses key limitations of traditional end-to-end LLM prediction approaches through its parametric inference architecture. This agent system supports major phases of the drug discovery pipeline, facilitating automated planning and execution of multi-stage research tasks. To address the critical challenge of multi-modal drug data analysis (incorporating both public datasets and user-submitted data), we developed an interactive parameterized memory pool. This innovative component standardizes real-world drug data into parametric representations, simultaneously enabling efficient knowledge retrieval in multi-turn dialogue while mitigating the information loss inherent in text-based data transmission. Additionally, we created a drug instruct dataset across 8 essential drug discovery tasks for model fine-tuning and evaluation. Based on the Berkeley function calling evaluation framework, DrugPilot demonstrated the most advanced tool calling capabilities on our drug discovery tool instruction dataset, outperforming existing agents (e.g., ReAct, LoT). Specifically, it achieves task completion rates of 98.0%, 93.5%, and 64.0% on simple, multiple, and multi-turn tasks, respectively.

摘要

在AI4Science领域,大规模语言模型(LLMs)展现出解析复杂科学语义、整合跨学科知识以及辅助关键任务研究的巨大潜力。然而在药物发现领域,尽管通过专业数据预训练、上下文窗口扩展和网络搜索进行了优化,现有LLMs仍面临多模态异构数据处理海量化、领域知识动态更新延迟以及对复杂计算任务结果预测置信度不足等挑战。针对这些问题,我们提出了DrugPilot——一种基于LLM的参数化推理药物发现智能体。该智能体通过参数化推理架构解决了传统端到端LLM预测方法的关键局限,支持药物发现流程的主要阶段,实现多阶段研究任务的自动化规划与执行。为应对多模态药物数据分析(整合公共数据集与用户提交数据)这一核心挑战,我们开发了交互式参数化记忆池。这一创新组件将真实世界药物数据标准化为参数化表征,既能实现多轮对话中的高效知识检索,又可缓解文本数据传输固有的信息损失。此外,我们构建了涵盖8项关键药物发现任务的药物指令数据集用于模型微调与评估。基于伯克利函数调用评估框架,DrugPilot在我们的药物发现工具指令数据集上展现出最先进的工具调用能力,优于现有智能体(如ReAct、LoT),在简单、多重和多轮任务中分别达到98.0%、93.5%和64.0%的任务完成率。


ProMind-LLM: Proactive Mental Health Care via Causal Reasoning with Sensor Data

Abstract

arXiv:2505.14038v1 Announce Type: new Abstract: Mental health risk is a critical global public health challenge, necessitating innovative and reliable assessment methods. With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications. Nevertheless, existing approaches predominantly rely on subjective textual mental records, which can be distorted by inherent mental uncertainties, leading to inconsistent and unreliable predictions. To address these limitations, this paper introduces ProMind-LLM. We investigate an innovative approach integrating objective behavior data as complementary information alongside subjective mental records for robust mental health risk assessment. Specifically, ProMind-LLM incorporates a comprehensive pipeline that includes domain-specific pretraining to tailor the LLM for mental health contexts, a self-refine mechanism to optimize the processing of numerical behavioral data, and causal chain-of-thought reasoning to enhance the reliability and interpretability of its predictions. Evaluations of two real-world datasets, PMData and Globem, demonstrate the effectiveness of our proposed methods, achieving substantial improvements over general LLMs. We anticipate that ProMind-LLM will pave the way for more dependable, interpretable, and scalable mental health case solutions.

摘要

心理健康风险是全球公共卫生面临的重大挑战,亟需创新且可靠的评估方法。随着大语言模型(LLMs)的发展,其有望成为可解释心理健康应用的有力工具。然而现有方法主要依赖主观文本心理记录,这些记录可能因内在心理不确定性而产生偏差,导致预测结果不一致且不可靠。为突破这些局限,本文提出ProMind-LLM框架,探索将客观行为数据作为主观心理记录的补充信息,以实现稳健的心理健康风险评估。具体而言,ProMind-LLM构建了包含三大创新模块的完整流程:针对心理健康领域进行领域适配预训练、通过自优化机制处理数值化行为数据、采用因果思维链推理增强预测可靠性与可解释性。在PMData和Globem两个真实数据集上的评估表明,该方法显著优于通用大语言模型。我们预期ProMind-LLM将为构建更可靠、可解释且可扩展的心理健康解决方案开辟新途径。


s3: You Don't Need That Much Data to Train a Search Agent via RL

Abstract

arXiv:2505.14146v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve-entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose s3, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naive RAG. s3 requires only 2.4k training samples to outperform baselines trained on over 70x more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.

摘要

检索增强生成(RAG)系统使大型语言模型(LLM)能够在推理过程中访问外部知识。近期研究通过强化学习(RL)让LLM作为搜索代理,借助与检索引擎的多轮交互提升信息获取能力。然而现有方法要么采用忽略下游效用的纯搜索指标(如NDCG)优化检索,要么对整个LLM进行微调以联合执行推理与检索——这种将检索与生成耦合的方式限制了实际搜索效用及与冻结/专有模型的兼容性。本研究提出s3框架:该轻量级、模型无关的方案将搜索器与生成器解耦,并通过"超越RAG的增益"(即相对基础RAG的生成准确率提升)训练搜索器。s3仅需2.4k训练样本即可超越使用70倍以上数据训练的基线模型,在六个通用QA和五个医疗QA基准测试中持续展现出更强的下游性能。


MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

Abstract

arXiv:2505.14148v1 Announce Type: new Abstract: Mathematical modeling is a cornerstone of scientific discovery and engineering practice, enabling the translation of real-world problems into formal systems across domains such as physics, biology, and economics. Unlike mathematical reasoning, which assumes a predefined formulation, modeling requires open-ended problem analysis, abstraction, and principled formalization. While Large Language Models (LLMs) have shown strong reasoning capabilities, they fall short in rigorous model construction, limiting their utility in real-world problem-solving. To this end, we formalize the task of LLM-powered real-world mathematical modeling, where agents must analyze problems, construct domain-appropriate formulations, and generate complete end-to-end solutions. We introduce MM-Bench, a curated benchmark of 111 problems from the Mathematical Contest in Modeling (MCM/ICM), spanning the years 2000 to 2025 and across ten diverse domains such as physics, biology, and economics. To tackle this task, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation. Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88% improvement over human expert solutions while requiring only 15 minutes and $0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (\textbf{top 2.0% among 27,456 teams}) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot. Our code is available at https://github.com/usail-hkust/LLM-MM-Agent


DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

Abstract

arXiv:2505.14163v1 Announce Type: new Abstract: Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning -- a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves -- to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent's learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude-3.5-Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT-4 using Program-of-Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum-based inference optimization.

摘要

大语言模型(LLM)代理在生成代码解决复杂数据科学问题方面展现出卓越性能。当前研究主要集中于通过改进搜索、采样和规划技术来增强上下文学习,却忽视了推理过程中问题解决顺序的重要性。本研究提出了一种新颖的推理时优化框架DSMentor,其采用课程学习策略——即先引入简单任务,随着学习者能力提升逐步过渡到复杂任务——以提升LLM代理在挑战性数据科学任务中的表现。该导师引导框架将数据科学任务按难度递增排序,并通过不断增长的长期记忆保留先前经验,从而指导代理的学习进程并更有效地利用累积知识。我们在DSEval和QRData基准上进行了大量实验评估,结果表明:相较于基线代理,采用Claude-3.5-Sonnet的DSMentor在两项基准上的通过率最高可提升5.2%。此外,DSMentor展现出更强的因果推理能力,在因果关系问题上相较采用思维程序提示的GPT-4实现了8.8%的通过率提升。本工作揭示了在推理过程中开发有效知识积累与运用策略的重要性,这种策略模拟了人类学习过程,为通过课程化推理优化提升LLM性能开辟了新途径。


RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

Abstract

arXiv:2505.14140v1 Announce Type: new Abstract: Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through sophisticated logical structures without modifying LLMs' parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to adaptively enhance LLM reasoning at inference time. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques by up to 13.4%. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://anonymous.4open.science/r/RL-LLM-Reasoning-1A30 for reproducibility.

摘要

尽管大语言模型(LLMs)发展迅速,但其基于token级自回归的特性限制了复杂推理能力。为增强LLM推理性能,推理时技术(如思维链/树/图)通过构建复杂逻辑结构引导推理而无需修改模型参数,以较低成本显著提升了表现。然而这些人工预定义、任务无关的框架在不同任务中统一应用,缺乏适应性。为此,我们提出强化学习思维框架(RLoT),利用强化学习(RL)训练轻量级导航器模型,在推理时自适应增强LLM推理能力。具体而言,我们从人类认知视角设计了五种基础逻辑模块。推理过程中,训练好的RL导航器根据问题特征动态选择合适逻辑模块,组合成任务特定的逻辑结构。在多个推理基准(AIME、MATH、GPQA等)和多种LLM(GPT、Llama、Qwen及DeepSeek)上的实验表明,RLoT相较现有推理时技术最高可提升13.4%性能。值得注意的是,仅需不足3K参数的RL导航器即可使百亿级以下LLM达到千亿级模型的水平。此外,该导航器展现出强迁移性:针对特定LLM-任务对训练的模型能有效泛化至未见过的LLM和任务。我们的代码已开源在https://anonymous.4open.science/r/RL-LLM-Reasoning-1A30以保证可复现性。


Reinforcement Learning vs. Distillation: Understanding Accuracy and Capability in LLM Reasoning

Abstract

arXiv:2505.14216v1 Announce Type: new Abstract: Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy but fails to improve capability, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR does not improve capability because it focuses on improving the accuracy of the less-difficult questions to the detriment of the accuracy of the most difficult questions, thereby leading to no improvement in capability. Second, we find that RLVR does not merely increase the success probability for the less difficult questions, but in our small model settings produces quality responses that were absent in its output distribution before training. In addition, we show these responses are neither noticeably longer nor feature more reflection-related keywords, underscoring the need for more reliable indicators of response quality. Third, we show that while distillation reliably improves accuracy by learning strong reasoning patterns, it only improves capability when new knowledge is introduced. Moreover, when distilling only with reasoning patterns and no new knowledge, the accuracy of the less-difficult questions improves to the detriment of the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in language models.

摘要

近期研究表明,采用可验证奖励的强化学习(RLVR)虽能提升整体准确率,却无法增强模型能力,而蒸馏方法则可同时改善两者。本文深入探讨了这些现象背后的机制。首先,我们证明RLVR未能提升能力的原因在于其专注于提高较简单问题的准确率,却以牺牲最难题目的正确率为代价,从而导致能力未见改善。其次,我们发现RLVR不仅增加了较简单问题的成功概率,在小模型设定下还会产生训练前输出分布中未曾出现的高质量回答。值得注意的是,这些回答既未显著增长篇幅,也未包含更多反思相关关键词,这凸显了寻找更可靠回答质量指标的必要性。第三,我们揭示蒸馏方法虽能通过学习强推理模式稳定提升准确率,但仅当引入新知识时才能增强模型能力。此外,当仅蒸馏推理模式而未引入新知识时,较简单问题的准确率提升同样会导致最难题目的表现下降,这与RLVR的效果相似。这些发现共同为理解RLVR和蒸馏如何影响语言模型的推理行为提供了更清晰的视角。


Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

Abstract

arXiv:2505.14141v1 Announce Type: new Abstract: Mobile GUI agents execute user commands by directly interacting with the graphical user interface (GUI) of mobile devices, demonstrating significant potential to enhance user convenience. However, these agents face considerable challenges in task planning, as they must continuously analyze the GUI and generate operation instructions step by step. This process often leads to difficulties in making accurate task plans, as GUI agents lack a deep understanding of how to effectively use the target applications, which can cause them to become "lost" during task execution. To address the task planning issue, we propose SPlanner, a plug-and-play planning module to generate execution plans that guide vision language model(VLMs) in executing tasks. The proposed planning module utilizes extended finite state machines (EFSMs) to model the control logits and configurations of mobile applications. It then decomposes a user instruction into a sequence of primary function modeled in EFSMs, and generate the execution path by traversing the EFSMs. We further refine the execution path into a natural language plan using an LLM. The final plan is concise and actionable, and effectively guides VLMs to generate interactive GUI actions to accomplish user tasks. SPlanner demonstrates strong performance on dynamic benchmarks reflecting real-world mobile usage. On the AndroidWorld benchmark, SPlanner achieves a 63.8% task success rate when paired with Qwen2.5-VL-72B as the VLM executor, yielding a 28.8 percentage point improvement compared to using Qwen2.5-VL-72B without planning assistance.

摘要

移动GUI代理通过直接与移动设备的图形用户界面(GUI)交互来执行用户指令,展现出提升用户便利性的巨大潜力。然而这些代理在任务规划方面面临重大挑战,因为它们需要持续分析GUI并逐步生成操作指令。这一过程往往导致任务规划准确性不足,由于GUI代理对目标应用程序的有效使用方式缺乏深入理解,可能在任务执行过程中出现"迷失"现象。为解决任务规划问题,我们提出SPlanner——一个即插即用的规划模块,用于生成执行计划以指导视觉语言模型(VLM)完成任务。该规划模块采用扩展有限状态机(EFSM)对移动应用程序的控制逻辑和配置进行建模,将用户指令分解为EFSM建模的初级功能序列,并通过遍历EFSM生成执行路径。我们进一步使用大语言模型(LLM)将执行路径细化为自然语言计划。最终生成的计划简洁且可操作,能有效指导VLM生成交互式GUI动作以完成用户任务。SPlanner在反映真实移动使用场景的动态基准测试中表现优异。在AndroidWorld基准测试中,当与Qwen2.5-VL-72B作为VLM执行器配合使用时,SPlanner实现了63.8%的任务成功率,相比无规划辅助的Qwen2.5-VL-72B提高了28.8个百分点。


SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning

Abstract

arXiv:2505.14147v1 Announce Type: new Abstract: Training large reasoning models (LRMs) with reinforcement learning in STEM domains is hindered by the scarcity of high-quality, diverse, and verifiable problem sets. Existing synthesis methods, such as Chain-of-Thought prompting, often generate oversimplified or uncheckable data, limiting model advancement on complex tasks. To address these challenges, we introduce SHARP, a unified approach to Synthesizing High-quality Aligned Reasoning Problems for LRMs reinforcement learning with verifiable rewards (RLVR). SHARP encompasses a strategic set of self-alignment principles -- targeting graduate and Olympiad-level difficulty, rigorous logical consistency, and unambiguous, verifiable answers -- and a structured three-phase framework (Alignment, Instantiation, Inference) that ensures thematic diversity and fine-grained control over problem generation. We implement SHARP by leveraging a state-of-the-art LRM to infer and verify challenging STEM questions, then employ a reinforcement learning loop to refine the model's reasoning through verifiable reward signals. Experiments on benchmarks such as GPQA demonstrate that SHARP-augmented training substantially outperforms existing methods, markedly improving complex reasoning accuracy and pushing LRM performance closer to expert-level proficiency. Our contributions include the SHARP strategy, framework design, end-to-end implementation, and experimental evaluation of its effectiveness in elevating LRM reasoning capabilities.

摘要

在STEM领域通过强化学习训练大型推理模型(LRM)面临高质量、多样化且可验证问题集稀缺的挑战。现有合成方法(如思维链提示)往往生成过于简化或不可验证的数据,限制了模型在复杂任务上的进步。为解决这些问题,我们提出SHARP方法——一种为LRM强化学习合成高质量对齐推理问题的统一框架,其特点在于可验证奖励机制(RLVR)。SHARP包含一套自对齐策略原则(瞄准研究生和奥赛级难度、严格逻辑一致性及明确可验证答案)和结构化三阶段框架(对齐、实例化、推理),确保问题生成的学科多样性和细粒度控制。我们通过前沿LRM推断并验证具有挑战性的STEM问题实现SHARP,继而采用强化学习循环,通过可验证奖励信号优化模型推理能力。在GPQA等基准测试上的实验表明,SHARP增强训练显著优于现有方法,大幅提升复杂推理准确率,使LRM性能逼近专家水平。本研究的贡献包括SHARP策略、框架设计、端到端实现及其提升LRM推理能力的有效性实验评估。


Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

Abstract

arXiv:2505.14299v1 Announce Type: new Abstract: Task-oriented dialogue systems based on Large Language Models (LLMs) have gained increasing attention across various industries and achieved significant results. Current approaches condense complex procedural workflows into a single agent to achieve satisfactory performance on large-scale LLMs. However, these approaches face challenges to achieve comparable performance on fine-tuned lightweight LLMs, due to their limited capabilities in handling multiple complex logic. In this work, we design a Domain-Independent Multi-Agent Framework (DIMF), which contains Intent Classification Agent, Slot Filling Agent and Response Agent. This approach simplifies the learning complexity and enhances the generalization ability by separating the tasks into domain-independent components. In this framework, we enhance the capabilities in contextual understanding using the Direct Preference Optimisation (DPO) method, and propose a simple and effective Data Distribution Adaptation (DDA) method to mitigate degradation issues during DPO training. Experiments conducted on the MultiWOZ datasets show that our proposed method achieves a better average performance among all the baselines. Extensive analysis also demonstrates that our proposed framework exhibits excellent generalizability and zero-shot capability.

摘要

基于大语言模型(LLM)的任务导向对话系统在各行业获得广泛关注并取得显著成果。当前方法将复杂流程工作压缩至单一智能体,从而在大规模LLM上实现满意性能。然而,由于处理多重复杂逻辑的能力有限,这些方法在微调后的轻量级LLM上难以达到可比性能。本研究设计了一个领域无关的多智能体框架(DIMF),包含意图分类智能体、槽填充智能体和响应智能体。该方法通过任务解耦为领域无关组件,降低了学习复杂度并增强泛化能力。在该框架中,我们采用直接偏好优化(DPO)方法提升上下文理解能力,并提出简单有效的数据分布适配(DDA)方法以缓解DPO训练中的性能退化问题。在MultiWOZ数据集上的实验表明,所提方法在所有基线中取得了更优的平均性能。深入分析也证明该框架具有出色的泛化能力和零样本学习能力。


SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

Abstract

arXiv:2505.14381v1 Announce Type: new Abstract: With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs can achieve better RAG performance, but processing rich documents still remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (\textbf{S}emanti\textbf{C} Document Layout \textbf{AN}alysis), a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems working with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components. We trained the SCAN model by fine-tuning object detection models with sophisticated annotation datasets. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%, outperforming conventional approaches and even commercial document processing solutions.

摘要

随着大型语言模型(LLMs)和视觉语言模型(VLMs)的广泛应用,面向检索增强生成(RAG)和视觉RAG等应用的富文档分析技术正受到极大关注。近期研究表明,使用VLMs可获得更优的RAG性能,但由于单页文档包含大量信息,富文档处理仍存在挑战。本文提出SCAN(语义文档布局分析),这是一种增强文本与视觉RAG系统处理富文档能力的新方法。该VLM友好型方案通过恰当的语义粒度识别文档组件,在上下文保留与处理效率间取得平衡。SCAN采用粗粒度语义方法,将文档划分为覆盖连续组件的连贯区域。我们通过精标注数据集微调目标检测模型来训练SCAN模型。英日双语数据集的实验结果表明:应用SCAN可使端到端文本RAG性能提升最高达9.0%,视觉RAG性能提升最高达6.4%,其表现优于传统方法乃至商用文档处理方案。


Knowledge Graph Based Repository-Level Code Generation

Abstract

arXiv:2505.14394v1 Announce Type: new Abstract: Recent advancements in Large Language Models (LLMs) have transformed code generation from natural language queries. However, despite their extensive knowledge and ability to produce high-quality code, LLMs often struggle with contextual accuracy, particularly in evolving codebases. Current code search and retrieval methods frequently lack robustness in both the quality and contextual relevance of retrieved results, leading to suboptimal code generation. This paper introduces a novel knowledge graph-based approach to improve code search and retrieval leading to better quality of code generation in the context of repository-level tasks. The proposed approach represents code repositories as graphs, capturing structural and relational information for enhanced context-aware code generation. Our framework employs a hybrid approach for code retrieval to improve contextual relevance, track inter-file modular dependencies, generate more robust code and ensure consistency with the existing codebase. We benchmark the proposed approach on the Evolutionary Code Benchmark (EvoCodeBench) dataset, a repository-level code generation benchmark, and demonstrate that our method significantly outperforms the baseline approach. These findings suggest that knowledge graph based code generation could advance robust, context-sensitive coding assistance tools.

摘要

大语言模型(LLM)的最新进展实现了从自然语言查询到代码生成的变革。然而,尽管LLMs具备丰富的知识并能够生成高质量代码,其在上下文准确性方面仍存在不足,尤其在持续演进的代码库中表现明显。现有代码搜索与检索方法在结果质量和上下文相关性方面往往缺乏鲁棒性,导致生成的代码未能达到最优水平。本文提出一种基于知识图谱的创新方法,旨在通过改进代码搜索与检索机制,提升仓库级任务场景下的代码生成质量。该方法将代码仓库表示为图结构,通过捕捉代码的结构化信息和关联关系来增强上下文感知的代码生成能力。我们设计了一个混合式代码检索框架,通过提升上下文相关性、追踪文件间模块化依赖关系,生成更具鲁棒性的代码,并确保与现有代码库的一致性。在仓库级代码生成基准测试集Evolutionary Code Benchmark(EvoCodeBench)上的实验表明,本方法显著优于基线方法。这些发现证明,基于知识图谱的代码生成技术有望推动具备上下文感知能力的鲁棒性编程辅助工具的发展。


SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors

Abstract

arXiv:2505.14300v1 Announce Type: new Abstract: High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions. Similarly, Large Language Models (LLMs) need monitoring safeguards. We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach that treats normal behavior as the baseline and harmful outputs as outliers. Our study focuses specifically on backdoor-triggered responses -- where specific input phrases activate hidden vulnerabilities causing the model to generate unsafe content like violence, pornography, or hate speech. We address two key challenges: (1) identifying true causal indicators rather than surface correlations, and (2) preventing advanced models from deception -- deliberately evading monitoring systems. Hence, we approach this problem from an unsupervised lens by drawing parallels to human deception: just as humans exhibit physical indicators while lying, we investigate whether LLMs display distinct internal behavioral signatures when generating harmful content. Our study addresses two critical challenges: 1) designing monitoring systems that capture true causal indicators rather than superficial correlations; and 2)preventing intentional evasion by increasingly capable "Future models''. Our findings show that models can produce harmful content through causal mechanisms and can become deceptive by: (a) alternating between linear and non-linear representations, and (b) modifying feature relationships. To counter this, we developed Safety-Net -- a multi-detector framework that monitors different representation dimensions, successfully detecting harmful behavior even when information is shifted across representational spaces to evade individual monitors. Our evaluation shows 96% accuracy in detecting harmful cases using our unsupervised ensemble approach.

摘要

核能与航空等高危行业通过实时监测来识别危险系统状态。类似地,大型语言模型(LLMs)也需要监控保障机制。我们提出一种实时预测框架,采用无监督学习方法,将正常行为作为基线、有害输出视为异常值,从而在有害AI输出发生前进行预判。本研究特别关注后门触发式响应——即特定输入短语会激活模型隐藏漏洞,导致其生成暴力、色情或仇恨言论等不安全内容。我们解决了两大核心挑战:(1) 识别真实因果指标而非表面关联;(2) 防止先进模型的欺骗行为——即故意规避监控系统。为此,我们从无监督角度切入,类比人类欺骗行为:正如人类说谎时会呈现生理指标,我们探究LLMs生成有害内容时是否表现出独特的内部行为特征。研究重点应对两大关键挑战:1) 设计能捕捉真实因果指标而非表面关联的监控系统;2) 防止能力日益增强的"未来模型"实施故意规避。研究发现表明,模型可通过因果机制生成有害内容,并能通过以下方式实施欺骗:(a) 在线性与非线性表征间切换;(b) 改变特征关联。为此我们开发了Safety-Net——一个多维监测框架,通过监控不同表征维度,即使信息在表征空间转移以规避单个监测器时,仍能成功检测有害行为。评估显示,我们的无监督集成方法在有害案例检测中达到96%准确率。


Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

Abstract

arXiv:2505.14391v1 Announce Type: new Abstract: Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps. Leveraging an LLM-based judger for annotation, we collect 1.7 million data samples to train a 7B PRM and evaluate it at both solution and step levels. Experimental results demonstrate that compared to existing open-source PRMs and PRMs trained on open-source datasets, our PRM achieves superior performance across various metrics, including search guidance, BoN, and F1 scores. Compared to widely used MC-based annotation methods, our annotation approach not only achieves higher data efficiency but also delivers superior performance. Detailed analysis is also conducted to demonstrate the stability and generalizability of our method.

摘要

许多研究专注于训练高效PRM(偏好奖励模型)的数据标注技术。然而现有方法在应用于长链思维推理(CoT)过程时存在显著缺陷:它们往往仅关注首个错误步骤及之前所有步骤,并假定后续步骤全部错误。这些方法忽视了长链CoT固有的自我修正与反思机制——即使在初始推理出现错误后,仍可能产生正确的推理步骤。为解决这一问题,我们提出一种专为长链CoT推理评分设计的PRM数据标注新方法。基于反思模式下正确与错误步骤常交替出现的特性,我们引入"错误传播"与"错误终止"概念,以增强PRM识别有效自我修正行为和基于错误步骤推理的能力。通过采用基于大语言模型的标注评判器,我们收集170万数据样本训练70亿参数PRM,并在解决方案和步骤两个层面进行评估。实验结果表明:相较于现有开源PRM及基于开源数据集训练的PRM,我们的模型在搜索引导、BoN和F1分数等多项指标上均表现更优。与广泛使用的基于多选(MC)的标注方法相比,我们的标注方案不仅具有更高数据效率,还能实现更优性能。我们还通过详细分析验证了方法的稳定性与泛化能力。


Causal Cartographer: From Mapping to Reasoning Over Counterfactual Worlds

Abstract

arXiv:2505.14396v1 Announce Type: new Abstract: Causal world models are systems that can answer counterfactual questions about an environment of interest, i.e. predict how it would have evolved if an arbitrary subset of events had been realized differently. It requires understanding the underlying causes behind chains of events and conducting causal inference for arbitrary unseen distributions. So far, this task eludes foundation models, notably large language models (LLMs), which do not have demonstrated causal reasoning capabilities beyond the memorization of existing causal relationships. Furthermore, evaluating counterfactuals in real-world applications is challenging since only the factual world is observed, limiting evaluation to synthetic datasets. We address these problems by explicitly extracting and modeling causal relationships and propose the Causal Cartographer framework. First, we introduce a graph retrieval-augmented generation agent tasked to retrieve causal relationships from data. This approach allows us to construct a large network of real-world causal relationships that can serve as a repository of causal knowledge and build real-world counterfactuals. In addition, we create a counterfactual reasoning agent constrained by causal relationships to perform reliable step-by-step causal inference. We show that our approach can extract causal knowledge and improve the robustness of LLMs for causal reasoning tasks while reducing inference costs and spurious correlations.

摘要

因果世界模型是一种能够回答关于目标环境反事实问题的系统,即预测当任意事件子集以不同方式实现时环境将如何演化。这需要理解事件链背后的潜在因果关系,并对任意未知分布进行因果推断。目前,基础模型(尤其是大型语言模型)尚未实现该能力,其因果推理仅限于对已知因果关系的记忆。此外,现实应用中反事实评估面临挑战,因仅能观测到事实世界,导致评估局限于合成数据集。我们通过显式提取和建模因果关系来解决这些问题,提出因果制图框架:首先设计图检索增强生成代理,负责从数据中检索因果关系。该方法能构建大规模现实世界因果关系网络,既可作为因果知识库,又能建立真实反事实场景。同时开发受因果关系约束的反事实推理代理,实现可靠的逐步因果推断。实验表明,该方法能有效提取因果知识,提升大型语言模型在因果推理任务中的鲁棒性,同时降低推理成本并减少伪相关性。


SkyMemory: A LEO Edge Cache for Transformer Inference Optimization and Scale Out

Abstract

arXiv:2505.14427v1 Announce Type: new Abstract: We expand the scope of cache memory to include LEO constellations, which are highly distributed systems with thousands of satellites connected with free-space optics inter-satellite links (ISL) always only one hop from any point on earth. We show how to increase the number of cache hits and improve the speed of inference for the important use case of LLMs. These benefits apply not only to LLMs, both terrestrially hosted and on satellites, but also generalize to any cache distributed over multiple locations that needs to be accessed in a timely manner. We show the benefit of our key value cache (KVC) protocol in simulations and present a proof-of-concept implementation of the protocol for KVCs on a testbed comprising 5 Intel NUC Linux mini PCs hosting a 19x5 constellation, with an NVIDIA Jetson Nano 8GB GPU hosting the LLM.

摘要

我们首次将缓存内存的应用范围扩展至低地球轨道(LEO)卫星星座——这种高度分布式系统由数千颗通过自由空间光通信星间链路(ISL)连接的卫星构成,与地球任意位置仅间隔单跳距离。针对大语言模型(LLM)这一重要应用场景,我们展示了如何提升缓存命中率并加速推理过程。这些优势不仅适用于地面部署和星载的LLM,还可推广至任何需要及时访问的分布式多节点缓存系统。通过仿真实验验证了我们提出的键值缓存(KVC)协议的性能优势,并在由5台Intel NUC Linux迷你主机(模拟19×5卫星星座)与搭载LLM的NVIDIA Jetson Nano 8GB GPU组成的测试平台上,实现了该协议的验证性部署。


PRL: Prompts from Reinforcement Learning

Abstract

arXiv:2505.14412v1 Announce Type: new Abstract: Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .

摘要

有效提示工程仍然是充分发挥大语言模型(LLM)潜力的核心挑战。虽然精心设计的提示能显著提升模型表现,但其构建通常需要专家直觉和对任务的深刻理解。更重要的是,最具影响力的提示往往依赖于微妙的语义线索——这些线索可能超出人类感知范围,却对引导LLM行为至关重要。本文提出PRL(基于强化学习的提示生成),一种创新的自动提示生成方法。与现有方法不同,PRL能够生成训练过程中未见过的新型少样本示例。我们的方法在文本分类、文本简化和摘要生成等多项基准测试中均达到最先进水平:在分类任务上以2.58%的优势超越APE方法,1.00%超越EvoPrompt;在摘要任务上将ROUGE平均分较APE提升4.32分,较EvoPrompt提升2.12分;在简化任务中SARI分数较APE提高6.93分,较EvoPrompt提高6.01分。代码已开源:https://github.com/Batorskq/prl。


Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

Abstract

arXiv:2505.14403v1 Announce Type: new Abstract: Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.

摘要

推理语言模型的最新进展呈现出从短链思维(CoT)模式向长链模式的范式转变。鉴于长链CoT模型中推演过程的高计算成本,最大化固定训练数据集的效用变得至关重要。我们的分析表明,负面响应中包含自我反思和纠错步骤等有价值成分,但现有主流方法要么完全丢弃负样本(RFT),要么对所有标记施加均等惩罚(RL),未能有效利用这些潜在学习信号。为此,我们提出带有负样本增强的行为约束策略梯度法(BCPG-NSA),该细粒度离线强化学习框架包含三个阶段:1)样本分割,2)结合LLM与PRM评判器的基于共识的步骤正确性评估,3)设计NSA策略优化以有效挖掘负样本中的正向步骤。实验结果表明,在使用相同训练数据集的情况下,BCPG-NSA在多个高难度数学/编程推理基准测试中优于基线方法,实现了更高的样本效率,并在扩展到多轮迭代时展现出鲁棒性和可扩展性。


Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach

Abstract

arXiv:2505.14479v1 Announce Type: new Abstract: Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI's o1 model (58%-70% improvement); both analogous problems and the verifier's feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.

摘要

大语言模型(LLMs)在需要严格逻辑演绎和符号推理的形式化领域(如数学证明生成)中存在困难。我们提出了一种神经符号方法,通过结合LLMs的生成能力与结构化组件来克服这一挑战。作为概念验证,我们聚焦于几何问题。该方法包含两个关键部分:(1)检索类似问题并利用其证明指导LLM;(2)通过形式化验证器评估生成证明并提供反馈,帮助模型修正错误。实验表明,该方法显著提升了OpenAI o1模型的证明准确率(提升幅度达58%-70%),且类比问题和验证器反馈均对此增益有所贡献。更广泛而言,转向生成可验证正确结论的LLMs将极大提升其可靠性、精确性与一致性,从而解锁需要可信度的复杂任务和关键现实应用。


Reasoning Models Better Express Their Confidence

Abstract

arXiv:2505.14489v1 Announce Type: new Abstract: Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models-LLMs that engage in extended chain-of-thought (CoT) reasoning-exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models-such as exploring alternative approaches and backtracking-which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that these gains are not exclusive to reasoning models-non-reasoning models also benefit when guided to perform slow thinking via in-context learning.

摘要

尽管大型语言模型(LLMs)具备强大能力,但它们往往无法准确传达其置信度,这使得评估其潜在错误变得困难并限制了其可靠性。本研究证明,推理模型(即进行扩展思维链(CoT)推理的LLMs)不仅在问题解决方面表现更优,还能更准确地表达其置信度。我们在六个数据集上对六种推理模型进行基准测试,发现其在36个实验场景中的33个场景下,均展现出严格优于非推理模型的置信度校准能力。深入分析表明,这些校准优势源于推理模型的慢思考行为(如探索替代方案和回溯机制),这些行为使其能够在思维链推理过程中动态调整置信度,从而持续提升准确性。特别值得注意的是,推理模型的校准效果会随着思维链的展开而不断增强,这一趋势在非推理模型中并未出现。此外,若从思维链中移除慢思考行为,校准性能会显著下降。最后,我们证实这些优势并非推理模型独有——通过上下文学习引导非推理模型执行慢思考后,其性能同样能获得提升。


SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Abstract

arXiv:2505.14615v1 Announce Type: new Abstract: We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.

摘要

我们推出SATBench,这是一个通过基于布尔可满足性(SAT)问题衍生的逻辑谜题来评估大语言模型(LLMs)逻辑推理能力的基准测试。与先前专注于基于推理规则(通常涉及从一组前提推导结论)的研究不同,我们的方法利用了SAT问题基于搜索的特性——其目标是找到满足特定逻辑约束的解。SATBench中的每个实例均从SAT公式生成,随后通过LLMs转化为故事背景和条件。该生成过程完全自动化,并允许通过调整子句数量来控制难度。所有2100个谜题均经过LLM辅助和求解器驱动的双重一致性验证,并对部分样本进行了人工校验。实验结果表明,即使性能最强的o4-mini模型在困难UNSAT问题上仅达到65.0%的正确率,接近50%的随机基线水平。SATBench揭示了当前LLMs在基于搜索的逻辑推理能力上的根本局限,为未来逻辑推理研究提供了可扩展的测试平台。


Let LLMs Break Free from Overthinking via Self-Braking Tuning

Abstract

arXiv:2505.14604v1 Announce Type: new Abstract: Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.

摘要

大型推理模型(LRMs),如OpenAI o1和DeepSeek-R1,通过生成长思维链显著提升了推理能力,在多种任务中展现出卓越性能。然而,这种性能提升伴随着生成过程中冗余推理的大幅增加,导致计算开销高昂并加剧了过度思考问题。尽管现有许多方法旨在解决过度思考问题,但它们往往依赖外部干预。本文提出了一种新颖框架——自制动调优(SBT),该框架通过允许模型自我调节推理过程来解决过度思考问题,从而消除对外部控制机制的依赖。我们基于标准答案构建了一套过度思考识别指标,并设计了一种系统性方法来检测冗余推理。该方法能准确识别推理轨迹中不必要的步骤,并为学习自我调节行为生成训练信号。在此基础上,我们开发了完整的自适应推理长度数据构建策略,并引入创新的制动提示机制,使模型能够自然学习在适当节点终止推理。在数学基准测试(AIME、AMC、MATH500、GSM8K)上的实验表明,我们的方法在保持与无约束模型相当准确度的同时,可降低高达60%的token消耗。


Cost-Augmented Monte Carlo Tree Search for LLM-Assisted Planning

Abstract

arXiv:2505.14656v1 Announce Type: new Abstract: While LLMs excel at open-ended reasoning, they often struggle with cost-sensitive planning, either treating all actions as having equal cost or failing to stay within strict budgets. In this paper, we introduce Cost-Augmented Monte Carlo Tree Search (CATS), a novel approach that brings explicit cost-awareness into LLM-guided planning. Tight cost constraints push the planner to quickly identify infeasible solutions, while looser constraints encourage optimization for minimal cost. We benchmark top LLMs such as GPT-4.1, Claude-3.7-Sonnet, and DeepSeek-R1, against our CATS planner to evaluate their performance in cost-sensitive scenarios. Our experiments suggest that raw LLMs such as GPT-4.1 often falter under tight budgets, whereas CATS consistently delivers strong performance, achieving higher task success rates and better cost efficiency. CATS provides an effective solution for budget-aware decision-making by combining the reasoning power of LLMs with structured search.

摘要

尽管大语言模型(LLMs)在开放式推理方面表现出色,但它们往往难以应对成本敏感的规划任务,要么将所有行动视为成本相同,要么无法严格遵守预算限制。本文提出成本增强蒙特卡洛树搜索(CATS),这一新方法将显式成本意识引入LLM引导的规划中。严格的成本约束促使规划器快速识别不可行方案,而宽松约束则鼓励以最小成本进行优化。我们针对GPT-4.1、Claude-3.7-Sonnet和DeepSeek-R1等顶尖LLMs与CATS规划器进行基准测试,评估其在成本敏感场景下的表现。实验表明,原始LLMs如GPT-4.1在严格预算下常常失效,而CATS始终保持优异性能,实现更高的任务成功率和更佳的成本效益。通过将LLMs的推理能力与结构化搜索相结合,CATS为预算感知决策提供了有效解决方案。


Debating for Better Reasoning: An Unsupervised Multimodal Approach

Abstract

arXiv:2505.14627v1 Announce Type: new Abstract: As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.

摘要

随着大型语言模型(LLMs)在跨领域多模态任务中展现出专业能力,可扩展监督机制变得日益困难——尤其当其能力可能超越人类评估者时。辩论机制已成为实现此类监督的重要途径。本研究将辩论范式扩展至多模态环境,探索弱模型监督并增强强模型性能的潜力。我们聚焦视觉问答(VQA)任务,让两个'有视觉能力'的专家级视觉语言模型就答案展开辩论,而'无视觉能力'(仅文本)的裁判则仅依据论证质量进行裁决。本框架中,专家仅捍卫与其信念一致的答案,从而避免显式角色扮演,并将辩论集中于专家存在分歧的实例。多项多模态任务实验表明,辩论框架持续优于单个专家模型。此外,通过微调,较弱LLM的评判有助于为视觉语言模型注入推理能力。


SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

Abstract

arXiv:2505.14667v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

摘要

大型推理模型(LRMs)已成为解决复杂问题的强大工具,但其结构化推理路径在接触有害提示时可能产生不安全输出。现有安全对齐方法虽能减少有害输出,却会削弱推理深度,导致复杂多步任务中的显著性能折衷,且仍难以抵御复杂越狱攻击。为此,我们提出SAFEPATH——一种轻量级对齐方法,通过微调使LRMs在检测到有害提示时,于推理起始处生成仅8个token的安全引导词,同时保持后续推理过程不受监督。多基准测试表明,SAFEPATH在维持推理性能的同时有效减少有害输出:在DeepSeek-R1-Distill-Llama-8B模型中,该方法将有害响应降低达90.0%,并拦截83.3%的越狱尝试,其计算消耗仅为直接拒绝法的1/295.9,为SafeChain的1/314.1。我们还提出无需微调的零样本变体。此外,通过系统分析现有大语言模型方法在以推理为核心的模型中的泛化表现(及其失效案例),揭示了构建更安全AI的关键缺口与新研究方向。


Guarded Query Routing for Large Language Models

Abstract

arXiv:2505.14524v1 Announce Type: new Abstract: Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be questions about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a \emph{guarded} query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench), which covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP, fastText), and traditional machine learning models (SVM, XGBoost). Our results show that WideMLP, enhanced with out-of-domain detection capabilities, yields the best trade-off between accuracy (88%) and speed (<4ms). The embedding-based fastText excels at speed (<1ms) with acceptable accuracy (80%), whereas LLMs yield the highest accuracy (91%) but are comparatively slow (62ms for local Llama-3.1:8B and 669ms for remote GPT-4o-mini calls). Our findings challenge the automatic reliance on LLMs for (guarded) query routing and provide concrete recommendations for practical applications. GQR-Bench will be released as a Python package -- \texttt{gqr}.

摘要

查询路由作为将用户查询分配至不同大语言模型(LLM)终端的任务,可视为文本分类问题。然而必须妥善处理分布外查询,这类查询可能涉及无关领域的问题、其他语言的查询甚至包含不安全文本。为此,我们研究了一种带防护机制的查询路由问题,首先构建了防护式查询路由基准测试(GQR-Bench),涵盖法律、金融和医疗三个典型目标领域及七个用于测试分布外查询鲁棒性的数据集。随后利用GQR-Bench对比了基于LLM的路由机制(GPT-4o-mini、Llama-3.2-3B和Llama-3.1-8B)、标准LLM防护方法(LlamaGuard与NVIDIA NeMo Guardrails)、连续词袋分类器(WideMLP、fastText)以及传统机器学习模型(SVM、XGBoost)的效能与效率。实验表明:增强域外检测能力的WideMLP在准确率(88%)与速度(<4ms)间达到最佳平衡;基于嵌入的fastText以可接受准确率(80%)实现最快速度(<1ms);而LLM虽获得最高准确率(91%),但速度较慢(本地Llama-3.1:8B需62ms,远程GPT-4o-mini调用需669ms)。本研究质疑了自动依赖LLM进行(防护式)查询路由的实践,并为实际应用提供了具体建议。GQR-Bench将以Python包形式发布——\texttt{gqr}。


Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training

Abstract

arXiv:2505.14681v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs) have achieved impressive reasoning capabilities by selectively activating experts to facilitate structured cognitive processes. Despite notable advances, existing reasoning models often suffer from cognitive inefficiencies like overthinking and underthinking. To address these limitations, we introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE), designed to improve reasoning performance without additional training or complex heuristics. Leveraging normalized Pointwise Mutual Information (nPMI), we systematically identify specialized experts, termed ''cognitive experts'' that orchestrate meta-level reasoning operations characterized by tokens like ''<think>''. Empirical evaluations with leading MoE-based LRMs (DeepSeek-R1 and Qwen3-235B) on rigorous quantitative and scientific reasoning benchmarks demonstrate noticeable and consistent improvements in reasoning accuracy, cognitive efficiency, and cross-domain generalization. Crucially, our lightweight approach substantially outperforms prevalent reasoning-steering techniques, such as prompt design and decoding constraints, while preserving the model's general instruction-following skills. These results highlight reinforcing cognitive experts as a promising, practical, and interpretable direction to enhance cognitive efficiency within advanced reasoning models.

摘要

大型推理模型(LRM)中的混合专家(MoE)架构通过选择性激活专家以促进结构化认知过程,已展现出卓越的推理能力。尽管取得显著进展,现有推理模型仍常受过度思考与思考不足等认知低效问题困扰。为此,我们提出一种名为"强化认知专家"(RICE)的新型推理时引导方法,旨在无需额外训练或复杂启发式规则的情况下提升推理性能。该方法利用归一化点间互信息(nPMI)系统化识别特定专家——即协调以"<think>"等标记为特征的元级推理操作的"认知专家"。基于主流MoE架构LRM(DeepSeek-R1与Qwen3-235B)在严格定量与科学推理基准上的实证评估表明,该方法在推理准确率、认知效率及跨领域泛化能力方面均取得显著且一致的提升。值得注意的是,这种轻量级方法显著优于提示工程和解码约束等主流推理引导技术,同时保持模型的通用指令遵循能力。这些成果表明,强化认知专家是提升高级推理模型认知效率的一条具有前景、实用性强且可解释的研究路径。


ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

Abstract

arXiv:2505.14668v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have propelled intelligent agents from reactive responses to proactive support. While promising, existing proactive agents either rely exclusively on observations from enclosed environments (e.g., desktop UIs) with direct LLM inference or employ rule-based proactive notifications, leading to suboptimal user intent understanding and limited functionality for proactive service. In this paper, we introduce ContextAgent, the first context-aware proactive agent that incorporates extensive sensory contexts to enhance the proactive capabilities of LLM agents. ContextAgent first extracts multi-dimensional contexts from massive sensory perceptions on wearables (e.g., video and audio) to understand user intentions. ContextAgent then leverages the sensory contexts and the persona contexts from historical data to predict the necessity for proactive services. When proactive assistance is needed, ContextAgent further automatically calls the necessary tools to assist users unobtrusively. To evaluate this new task, we curate ContextAgentBench, the first benchmark for evaluating context-aware proactive LLM agents, covering 1,000 samples across nine daily scenarios and twenty tools. Experiments on ContextAgentBench show that ContextAgent outperforms baselines by achieving up to 8.5% and 6.0% higher accuracy in proactive predictions and tool calling, respectively. We hope our research can inspire the development of more advanced, human-centric, proactive AI assistants.

摘要

大型语言模型(LLM)的最新进展将智能代理从被动响应推向主动支持。尽管前景广阔,现有主动代理要么仅依赖封闭环境(如桌面用户界面)的观察结果进行直接LLM推理,要么采用基于规则的主动通知,导致用户意图理解欠佳且主动服务功能有限。本文提出ContextAgent,首个结合多维度感知情境来增强LLM代理主动能力的上下文感知主动代理。ContextAgent首先从可穿戴设备(如视频和音频)的海量感知数据中提取多维上下文以理解用户意图,随后利用这些感知上下文及历史数据中的人物角色上下文来预测主动服务的必要性。当需要主动协助时,ContextAgent会进一步自动调用必要工具以无干扰方式辅助用户。为评估这一新任务,我们构建了首个上下文感知主动LLM代理评测基准ContextAgentBench,涵盖九类日常场景和二十种工具的1,000个样本。在ContextAgentBench上的实验表明,ContextAgent在主动预测和工具调用准确率上分别以8.5%和6.0%的优势超越基线方法。我们希望这项研究能推动更先进、以人为中心的主动式AI助手的发展。


Abacus: A Cost-Based Optimizer for Semantic Operator Systems

Abstract

arXiv:2505.14661v1 Announce Type: new Abstract: LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic operators: a declarative set of AI-powered data transformations with natural language specifications. These include LLM-powered maps, filters, joins, etc. used for document processing tasks such as information extraction, summarization, and more. While systems of semantic operators have achieved strong performance on benchmarks, they can be difficult to optimize. An optimizer for this setting must determine how to physically implement each semantic operator in a way that optimizes the system globally. Existing optimizers are limited in the number of optimizations they can apply, and most (if not all) cannot optimize system quality, cost, or latency subject to constraint(s) on the other dimensions. In this paper we present Abacus, an extensible, cost-based optimizer which searches for the best implementation of a semantic operator system given a (possibly constrained) optimization objective. Abacus estimates operator performance by leveraging a minimal set of validation examples and, if available, prior beliefs about operator performance. We evaluate Abacus on document processing workloads in the biomedical and legal domains (BioDEX; CUAD) and multi-modal question answering (MMQA). We demonstrate that systems optimized by Abacus achieve 18.7%-39.2% better quality and up to 23.6x lower cost and 4.2x lower latency than the next best system.

摘要

大型语言模型(LLMs)为海量非结构化文档的数据处理应用开辟了令人振奋的新范式。多个新型编程框架使开发者能够通过组合语义运算符来构建此类应用——这是一组具有自然语言规范的声明式AI驱动数据转换工具,包括用于信息抽取、摘要等文档处理任务的LLM驱动的映射、过滤、连接等操作。尽管语义运算符系统在基准测试中表现出色,但其优化仍面临挑战。该场景下的优化器必须确定如何以全局最优方式物理实现每个语义运算符。现有优化器可应用的优化手段有限,且大多数(若非全部)无法在质量、成本或延迟等维度存在约束条件下实现多目标优化。本文提出Abacus:一个可扩展的基于代价的优化器,能够在给定(可能带约束的)优化目标下搜索语义运算符系统的最佳实现方案。Abacus通过利用最小验证样例集及(可选的)先验性能认知来评估算子性能。我们在生物医学与法律领域(BioDEX;CUAD)的文档处理任务以及多模态问答(MMQA)上评估Abacus,实验表明经Abacus优化的系统相较次优方案可实现18.7%-39.2%的质量提升,并降低最高23.6倍的成本和4.2倍的延迟。


SLOT: Sample-specific Language Model Optimization at Test-time

Abstract

arXiv:2505.12392v1 Announce Type: cross Abstract: We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.

摘要

我们提出SLOT(测试时样本特异性语言模型优化),这是一种新颖且参数高效的测试时推理方法,旨在增强语言模型对单个提示的精准响应能力。现有大型语言模型(LLMs)在处理复杂指令时往往表现不佳,导致其在通用样本中代表性不足的指令上性能较差。为解决这一问题,SLOT在测试时执行少量优化步骤,更新轻量级的样本特异性参数向量。该向量被添加至输出层前的最终隐藏层,并通过在单样本优化期间缓存最后一层特征实现高效适配。仅通过最小化输入提示的交叉熵损失,SLOT使模型能更好地对齐并遵循每个给定指令。实验表明,我们的方法在多个基准测试和不同LLMs上均优于对比模型。例如,搭载SLOT的Qwen2.5-7B在GSM8K上的准确率从57.54%提升至66.19%,增益达8.6%;而配备SLOT的DeepSeek-R1-Distill-Llama-70B则在GPQA上以68.69%的准确率创下70B级模型的最优性能。代码已开源:https://github.com/maple-research-lab/SLOT。


ProdRev: A DNN framework for empowering customers using generative pre-trained transformers

Abstract

arXiv:2505.13491v1 Announce Type: cross Abstract: Following the pandemic, customers, preference for using e-commerce has accelerated. Since much information is available in multiple reviews (sometimes running in thousands) for a single product, it can create decision paralysis for the buyer. This scenario disempowers the consumer, who cannot be expected to go over so many reviews since its time consuming and can confuse them. Various commercial tools are available, that use a scoring mechanism to arrive at an adjusted score. It can alert the user to potential review manipulations. This paper proposes a framework that fine-tunes a generative pre-trained transformer to understand these reviews better. Furthermore, using "common-sense" to make better decisions. These models have more than 13 billion parameters. To fine-tune the model for our requirement, we use the curie engine from generative pre-trained transformer (GPT3). By using generative models, we are introducing abstractive summarization. Instead of using a simple extractive method of summarizing the reviews. This brings out the true relationship between the reviews and not simply copy-paste. This introduces an element of "common sense" for the user and helps them to quickly make the right decisions. The user is provided the pros and cons of the processed reviews. Thus the user/customer can take their own decisions.

摘要

本文提出一种基于生成式预训练变换器的微调框架,以更深入理解评论内容,并运用"常识"辅助决策。该模型参数量超130亿,为满足需求我们采用GPT3中的Curie引擎进行微调。通过生成式模型实现抽象化摘要生成,而非简单抽取式摘要方法,从而揭示评论间真实关联而非机械复制。这为用户引入"常识"要素,助其快速做出正确决策。系统将处理后的评论优缺点呈现给用户,使其能自主做出购买决策。


Pel, A Programming Language for Orchestrating AI Agents

Abstract

arXiv:2505.13453v1 Announce Type: cross Abstract: The proliferation of Large Language Models (LLMs) has opened new frontiers in computing, yet controlling and orchestrating their capabilities beyond simple text generation remains a challenge. Current methods, such as function/tool calling and direct code generation, suffer from limitations in expressiveness, scalability, cost, security, and the ability to enforce fine-grained control. This paper introduces Pel, a novel programming language specifically designed to bridge this gap. Inspired by the strengths of Lisp, Elixir, Gleam, and Haskell, Pel provides a syntactically simple, homoiconic, and semantically rich platform for LLMs to express complex actions, control flow, and inter-agent communication safely and efficiently. Pel's design emphasizes a minimal, easily modifiable grammar suitable for constrained LLM generation, eliminating the need for complex sandboxing by enabling capability control at the syntax level. Key features include a powerful piping mechanism for linear composition, first-class closures enabling easy partial application and functional patterns, built-in support for natural language conditions evaluated by LLMs, and an advanced Read-Eval-Print-Loop (REPeL) with Common Lisp-style restarts and LLM-powered helper agents for automated error correction. Furthermore, Pel incorporates automatic parallelization of independent operations via static dependency analysis, crucial for performant agentic systems. We argue that Pel offers a more robust, secure, and expressive paradigm for LLM orchestration, paving the way for more sophisticated and reliable AI agentic frameworks.

摘要

大型语言模型(LLMs)的激增为计算领域开辟了新前沿,但如何超越简单文本生成来控制和协调其能力仍是一个挑战。现有方法(如函数/工具调用和直接代码生成)在表达能力、可扩展性、成本、安全性及实施细粒度控制方面存在局限。本文提出Pel——一种专为弥合这一差距而设计的新型编程语言。受Lisp、Elixir、Gleam和Haskell的启发,Pel提供了语法简洁、同像性且语义丰富的平台,使LLMs能够安全高效地表达复杂操作、控制流和智能体间通信。Pel的设计强调最小化且易于修改的语法,适用于受限的LLM生成环境,通过在语法层面实现能力控制,消除了复杂沙箱的需求。其关键特性包括:用于线性组合的强大管道机制、支持简单偏应用和函数模式的一等闭包、LLM评估的自然语言条件内置支持,以及具备Common Lisp风格重启功能和LLM驱动辅助代理的先进REPeL交互环境(用于自动纠错)。此外,Pel通过静态依赖分析实现独立操作的自动并行化,这对高性能智能体系统至关重要。我们认为,Pel为LLM协调提供了更健壮、安全且富有表现力的范式,为构建更复杂可靠的AI智能体框架奠定了基础。


LODGE: Joint Hierarchical Task Planning and Learning of Domain Models with Grounded Execution

Abstract

arXiv:2505.13497v1 Announce Type: cross Abstract: Large Language Models (LLMs) enable planning from natural language instructions using implicit world knowledge, but often produce flawed plans that require refinement. Instead of directly predicting plans, recent methods aim to learn a problem domain that can be solved for different goal states using classical planners. However, these approaches require significant human feedback to obtain useful models. We address this shortcoming by learning hierarchical domains, where low-level predicates and actions are composed into higher-level counterparts, and by leveraging simulation to validate their preconditions and effects. This hierarchical approach is particularly powerful for long-horizon planning, where LLM-based planning approaches typically struggle. Furthermore, we introduce a central error reasoner to ensure consistency among the different planning levels. Evaluation on two challenging International Planning Competition (IPC) domains and a long-horizon robot manipulation task demonstrates higher planning success rates than state-of-the-art domain synthesis and LLM-modulo planning methods, while constructing high-quality models of the domain. Resources, videos and detailed experiment results are available at https://claudius-kienle.github.io/lodge/.

摘要

大语言模型(LLMs)能够利用隐含的世界知识根据自然语言指令进行规划,但生成的计划常存在缺陷需进一步优化。近期研究方法不再直接预测计划,而是通过学习可针对不同目标状态通过经典规划器求解的问题域。然而,这些方法需要大量人工反馈才能获得有效模型。我们通过以下方式改进这一不足:学习分层域结构(将底层谓词与动作组合为高层对应项),并利用仿真验证其前提条件与效果。这种分层方法尤其适用于长时程规划任务——基于LLM的规划方法通常在此类任务中表现欠佳。此外,我们引入核心错误推理器以确保不同规划层级间的一致性。在国际规划竞赛(IPC)两个高难度领域及长时程机器人操作任务上的评估表明,相较于最先进的域合成方法与LLM-modulo规划方法,本方法实现了更高的规划成功率,同时构建了高质量的领域模型。相关资源、视频及详细实验结果见https://claudius-kienle.github.io/lodge/。


Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale

Abstract

arXiv:2505.13480v1 Announce Type: cross Abstract: Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at https://github.com/av9ash/llm_cssrs_code.

摘要

自杀预防仍是公共卫生领域的一项关键挑战。尽管Reddit的r/SuicideWatch等在线平台历来为个体提供了表达自杀意念和寻求社群支持的空间,但大语言模型(LLMs)的出现带来了新范式——个体可能开始向AI系统而非人类披露自杀倾向。本研究评估了LLMs采用哥伦比亚自杀严重程度评定量表(C-SSRS)进行自动化自杀风险评估的能力。我们测试了六种模型(包括Claude、GPT、Mistral和LLaMA)在7级严重程度量表(0-6级)上对帖文进行零样本分类的表现。结果表明:Claude和GPT与人工标注结果高度吻合,而Mistral的序数预测误差最低。多数模型表现出序数敏感性,误判通常发生在相邻严重等级之间。我们进一步分析了混淆模式、误判来源及伦理问题,强调人类监督、透明度和谨慎部署的重要性。完整代码及补充材料见https://github.com/av9ash/llm_cssrs_code。


Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Abstract

arXiv:2505.13500v1 Announce Type: cross Abstract: Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p < 0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.

摘要

大语言模型(LLMs)中的安全护栏是防止有害输出的关键组件。然而,其在扰动下的鲁棒性仍鲜为人知。本文通过系统地向模型激活中注入高斯噪声,研究了LLMs安全微调的鲁棒性。我们在多个开源模型上发现:(1)高斯噪声可使有害输出率最高提升27%(p < 0.001);(2)更深层次的安全微调无法提供额外保护;(3)思维链推理基本保持完整。这些发现揭示了当前安全对齐技术的关键脆弱性,并指出基于推理和强化学习的方法有望成为开发更鲁棒AI安全系统的方向。研究结果对LLMs在安全关键场景的实际部署具有重要意义——结果表明,广泛采用的安全调优方法即使在没有对抗性提示的情况下也可能失效。


LLM Context Conditioning and PWP Prompting for Multimodal Validation of Chemical Formulas

Abstract

arXiv:2505.12257v1 Announce Type: cross Abstract: Identifying subtle technical errors within complex scientific and technical documents, especially those requiring multimodal interpretation (e.g., formulas in images), presents a significant hurdle for Large Language Models (LLMs) whose inherent error-correction tendencies can mask inaccuracies. This exploratory proof-of-concept (PoC) study investigates structured LLM context conditioning, informed by Persistent Workflow Prompting (PWP) principles, as a methodological strategy to modulate this LLM behavior at inference time. The approach is designed to enhance the reliability of readily available, general-purpose LLMs (specifically Gemini 2.5 Pro and ChatGPT Plus o3) for precise validation tasks, crucially relying only on their standard chat interfaces without API access or model modifications. To explore this methodology, we focused on validating chemical formulas within a single, complex test paper with known textual and image-based errors. Several prompting strategies were evaluated: while basic prompts proved unreliable, an approach adapting PWP structures to rigorously condition the LLM's analytical mindset appeared to improve textual error identification with both models. Notably, this method also guided Gemini 2.5 Pro to repeatedly identify a subtle image-based formula error previously overlooked during manual review, a task where ChatGPT Plus o3 failed in our tests. These preliminary findings highlight specific LLM operational modes that impede detail-oriented validation and suggest that PWP-informed context conditioning offers a promising and highly accessible technique for developing more robust LLM-driven analytical workflows, particularly for tasks requiring meticulous error detection in scientific and technical documents. Extensive validation beyond this limited PoC is necessary to ascertain broader applicability.

摘要

识别复杂科学与技术文档中的细微技术错误(尤其是需要多模态解读的内容,如图像中的公式),对于具有固有纠错倾向的大型语言模型(LLM)而言存在显著障碍,这种倾向可能掩盖不准确性。本探索性概念验证(PoC)研究基于持久工作流提示(PWP)原则,通过结构化LLM上下文调节方法,在推理阶段调控LLM行为。该方法旨在提升通用型LLM(具体为Gemini 2.5 Pro和ChatGPT Plus o3)在精确验证任务中的可靠性,关键仅依赖标准聊天界面而无需API访问或模型修改。为验证方法有效性,我们聚焦于一份已知包含文本与图像错误的复杂测试论文中的化学公式验证。评估多种提示策略发现:基础提示可靠性不足,而采用PWP结构严格调节LLM分析思维的方法可提升两款模型的文本错误识别能力。值得注意的是,该方法还引导Gemini 2.5 Pro多次识别出人工审查时遗漏的图像公式细微错误,而ChatGPT Plus o3在此任务中失败。这些初步发现揭示了阻碍细节导向验证的特定LLM操作模式,表明基于PWP的上下文调节为开发更稳健的LLM驱动分析工作流(尤其针对科技文档中需要精细错误检测的任务)提供了高可行性的技术路径。但需超越本次有限概念验证的广泛测试以确定其普适性。


Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Abstract

arXiv:2505.13499v1 Announce Type: cross Abstract: We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches.

摘要

我们通过最优控制理论的视角研究Transformer模型,利用连续时间公式化的工具来推导关于训练和架构设计的可行见解。该框架在提升现有Transformer模型性能的同时,提供了理想的理论保证,包括泛化性和鲁棒性。我们的框架采用即插即用设计,能够与成熟的Transformer模型无缝集成,仅需对实现进行微小改动。我们在文本生成、情感分析、图像分类和点云分类等任务上进行了七项大规模实验。结果表明,该框架在提升基线模型测试性能的同时具有更高的参数效率。在使用nanoGPT进行字符级文本生成时,我们的框架实现了46%的最终测试损失降低,同时减少42%的参数用量;在GPT-2上则实现了5.6%的最终测试损失降低,证明其可扩展至更大模型。据我们所知,这是首个将最优控制理论同时应用于Transformer训练和架构的工作,为系统化、理论驱动的改进提供了新基础,超越了高成本的试错方法。


An agentic system with reinforcement-learned subsystem improvements for parsing form-like documents

Abstract

arXiv:2505.13504v1 Announce Type: cross Abstract: Extracting alphanumeric data from form-like documents such as invoices, purchase orders, bills, and financial documents is often performed via vision (OCR) and learning algorithms or monolithic pipelines with limited potential for systemic improvements. We propose an agentic AI system that leverages Large Language Model (LLM) agents and a reinforcement learning (RL) driver agent to automate consistent, self-improving extraction under LLM inference uncertainty. Our work highlights the limitations of monolithic LLM-based extraction and introduces a modular, multi-agent framework with task-specific prompts and an RL policy of rewards and penalties to guide a meta-prompting agent to learn from past errors and improve prompt-based actor agents. This self-corrective adaptive system handles diverse documents, file formats, layouts, and LLMs, aiming to automate accurate information extraction without the need for human intervention. Results as reported on two benchmark datasets of SOIRE, and CORD, are promising for the agentic AI framework.

摘要

从发票、采购订单、账单和财务单据等表单类文档中提取字母数字数据,通常通过视觉(OCR)和机器学习算法或具有有限系统性改进潜力的单体流程实现。我们提出一种基于智能代理的人工智能系统,该系统利用大语言模型(LLM)代理和强化学习(RL)驱动代理,在LLM推理不确定性下实现自动化、持续且自我改进的数据提取。本研究揭示了基于单体LLM提取方法的局限性,并引入一个模块化多代理框架,该框架包含任务特定提示和奖惩机制的RL策略,以引导元提示代理从过往错误中学习并改进基于提示的执行代理。这种自我修正的自适应系统能够处理多样化的文档、文件格式、版面和LLM模型,旨在实现无需人工干预的精准信息自动化提取。在SOIRE和CORD两个基准数据集上的实验结果表明,该智能代理框架具有显著优势。


EcoSafeRAG: Efficient Security through Context Analysis in Retrieval-Augmented Generation

Abstract

arXiv:2505.13506v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) compensates for the static knowledge limitations of Large Language Models (LLMs) by integrating external knowledge, producing responses with enhanced factual correctness and query-specific contextualization. However, it also introduces new attack surfaces such as corpus poisoning at the same time. Most of the existing defense methods rely on the internal knowledge of the model, which conflicts with the design concept of RAG. To bridge the gap, EcoSafeRAG uses sentence-level processing and bait-guided context diversity detection to identify malicious content by analyzing the context diversity of candidate documents without relying on LLM internal knowledge. Experiments show EcoSafeRAG delivers state-of-the-art security with plug-and-play deployment, simultaneously improving clean-scenario RAG performance while maintaining practical operational costs (relatively 1.2×\times latency, 48%-80% token reduction versus Vanilla RAG).

摘要

检索增强生成(RAG)通过整合外部知识弥补了大语言模型(LLMs)静态知识的局限性,生成具有更高事实准确性和查询特定情境化的响应。然而,该方法也同时引入了新的攻击面,如语料库投毒。现有防御方法大多依赖模型内部知识,这与RAG的设计理念存在冲突。为填补这一空白,EcoSafeRAG采用句子级处理和诱饵引导的上下文多样性检测技术,通过分析候选文档的上下文多样性来识别恶意内容,且无需依赖LLM内部知识。实验表明,EcoSafeRAG在即插即用部署下实现了最先进的安全性,同时提升了干净场景下的RAG性能,并保持实际运营成本(相对Vanilla RAG仅增加1.2倍延迟,减少48%-80%的token消耗)。


IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation

Abstract

arXiv:2505.13498v1 Announce Type: cross Abstract: Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80% of the time, and answer correctly 55.8% of the time compared to 76.2% in English for the best-performing model. We release IRLBench (https://huggingface.co/datasets/ReliableAI/IRLBench) and an accompanying evaluation codebase (https://github.com/ReML-AI/IRLBench) to enable future research on robust, culturally aware multilingual AI development.

摘要

尽管大语言模型(LLMs)的最新进展已展现出可观的知识储备与推理能力,但其在多语言及低资源环境下的性能仍未得到充分探索。现有基准测试普遍存在文化偏见、仅限于纯文本评估、依赖多项选择形式等局限,更重要的是对极度低资源语言的覆盖不足。为填补这些空白,我们推出了并行呈现英语与爱尔兰语的IRLBench(联合国教科文组织认定爱尔兰语为明确濒危语言)。该基准包含12个源自2024年爱尔兰毕业证书考试的典型科目,支持跨领域模型能力的细粒度分析。通过将任务构建为长文本生成并采用官方评分标准,本基准不仅能全面评估答案正确性,还可检验语言忠实度。我们对主流闭源与开源LLMs的广泛实验表明:英语与爱尔兰语之间存在持续性能差距,表现最佳模型的爱尔兰语有效回答率不足80%,正确率为55.8%,而其英语正确率达76.2%。我们公开了IRLBench数据集(https://huggingface.co/datasets/ReliableAI/IRLBench)及配套评估代码库(https://github.com/ReML-AI/IRLBench),以促进面向鲁棒性、文化敏感型多语言AI开发的后续研究。


Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models

Abstract

arXiv:2505.13514v1 Announce Type: cross Abstract: Repetition curse is a phenomenon where Large Language Models (LLMs) generate repetitive sequences of tokens or cyclic sequences. While the repetition curse has been widely observed, its underlying mechanisms remain poorly understood. In this work, we investigate the role of induction heads--a specific type of attention head known for their ability to perform in-context learning--in driving this repetitive behavior. Specifically, we focus on the "toxicity" of induction heads, which we define as their tendency to dominate the model's output logits during repetition, effectively excluding other attention heads from contributing to the generation process. Our findings have important implications for the design and training of LLMs. By identifying induction heads as a key driver of the repetition curse, we provide a mechanistic explanation for this phenomenon and suggest potential avenues for mitigation. We also propose a technique with attention head regularization that could be employed to reduce the dominance of induction heads during generation, thereby promoting more diverse and coherent outputs.

摘要

重复诅咒是指大型语言模型(LLMs)生成重复或循环的令牌序列的现象。尽管这一现象已被广泛观察到,但其内在机制仍不甚明了。本研究探讨了归纳头(一种以执行上下文学习能力著称的特定注意力头)在驱动这种重复行为中的作用。具体而言,我们聚焦于归纳头的"毒性",即其在重复过程中主导模型输出逻辑值的倾向,这种倾向会实质上排除其他注意力头对生成过程的贡献。我们的发现对LLMs的设计和训练具有重要意义。通过将归纳头识别为重复诅咒的关键驱动因素,我们为这一现象提供了机制性解释,并提出了潜在的缓解途径。此外,我们还提出了一种采用注意力头正则化的技术,该技术可用于减少生成过程中归纳头的主导地位,从而促进更多样化且连贯的输出。


Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering

Abstract

arXiv:2505.13520v1 Announce Type: cross Abstract: Textbook question answering (TQA) is a complex task, requiring the interpretation of complex multimodal context. Although recent advances have improved overall performance, they often encounter difficulties in educational settings where accurate semantic alignment and task-specific document retrieval are essential. In this paper, we propose a novel approach to multimodal textbook question answering by introducing a mechanism for enhancing semantic representations through multi-objective joint training. Our model, Joint Embedding Training With Ranking Supervision for Textbook Question Answering (JETRTQA), is a multimodal learning framework built on a retriever--generator architecture that uses a retrieval-augmented generation setup, in which a multimodal large language model generates answers. JETRTQA is designed to improve the relevance of retrieved documents in complex educational contexts. Unlike traditional direct scoring approaches, JETRTQA learns to refine the semantic representations of questions and documents through a supervised signal that combines pairwise ranking and implicit supervision derived from answers. We evaluate our method on the CK12-QA dataset and demonstrate that it significantly improves the discrimination between informative and irrelevant documents, even when they are long, complex, and multimodal. JETRTQA outperforms the previous state of the art, achieving a 2.4% gain in accuracy on the validation set and 11.1% on the test set.

摘要

教科书问答(TQA)是一项复杂任务,需要理解多模态上下文。尽管近期研究提升了整体性能,但在需要精确语义对齐和任务特定文档检索的教育场景中仍存在困难。本文提出一种通过多目标联合训练增强语义表征的新型多模态教科书问答方法。我们构建的JETRTQA模型是基于检索器-生成器架构的多模态学习框架,采用检索增强生成机制,通过多模态大语言模型生成答案。该模型旨在提升复杂教育场景中检索文档的相关性,区别于传统直接评分方法,它通过结合成对排序和答案隐含监督的信号,优化问题与文档的语义表征。在CK12-QA数据集上的实验表明,即使面对长文本、复杂多模态文档,该方法也能显著提升信息性文档与无关文档的区分能力。JETRTQA以验证集准确率提升2.4%、测试集提升11.1%的表现超越现有最佳水平。


LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades

Abstract

arXiv:2505.13515v1 Announce Type: cross Abstract: As Large Language Models (LLMs) are frequently updated, LoRA weights trained on earlier versions quickly become obsolete. The conventional practice of retraining LoRA weights from scratch on the latest model is costly, time-consuming, and environmentally detrimental, particularly as the diversity of LLMs and downstream tasks expands. This motivates a critical question: "How can we efficiently leverage existing LoRA weights to adapt to newer model versions?" To address this, we propose LoRASuite, a modular approach tailored specifically to various types of LLM updates. First, we compute a transfer matrix utilizing known parameters from both old and new LLMs. Next, we allocate corresponding layers and attention heads based on centered kernel alignment and cosine similarity metrics, respectively. A subsequent small-scale, skillful fine-tuning step ensures numerical stability. Experimental evaluations demonstrate that LoRASuite consistently surpasses small-scale vanilla LoRA methods. Notably, on backbone LLMs such as MiniCPM and Qwen, LoRASuite even exceeds the performance of full-scale LoRA retraining, with average improvements of +1.4 and +6.6 points on math tasks, respectively. Additionally, LoRASuite significantly reduces memory consumption by 5.5 GB and computational time by 78.23%.

摘要

随着大语言模型(LLMs)的频繁更新,基于早期版本训练的LoRA权重会迅速过时。传统做法是在最新模型上从头开始重新训练LoRA权重,这种方式成本高昂、耗时且对环境有害,尤其是在LLMs多样性和下游任务不断扩展的背景下。这引发了一个关键问题:"如何高效利用现有LoRA权重来适配新版本模型?"为此,我们提出LoRASuite——一种针对不同类型LLM更新的模块化解决方案。首先,我们利用新旧LLMs的已知参数计算转移矩阵;其次,分别基于中心核对齐和余弦相似度指标分配对应层和注意力头;最后通过小规模精细调优确保数值稳定性。实验评估表明,LoRASuite始终优于小规模普通LoRA方法。值得注意的是,在MiniCPM和Qwen等骨干LLMs上,LoRASuite甚至超越全量LoRA重训练的表现,数学任务平均分别提升1.4和6.6个百分点。此外,LoRASuite显著降低内存消耗5.5GB,计算时间减少78.23%。


Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Abstract

arXiv:2505.13527v1 Announce Type: cross Abstract: Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.

摘要

尽管在使大语言模型(LLMs)与人类价值观对齐方面取得了显著进展,但现有的安全机制仍易受到越狱攻击。我们假设这种脆弱性源于对齐导向提示与恶意提示之间的分布差异。为探究这一问题,我们提出了LogiBreak——一种新颖且通用的黑盒越狱方法,该方法通过逻辑表达式转换来规避LLM安全系统。通过将有害的自然语言提示转化为形式化逻辑表达式,LogiBreak利用对齐数据与逻辑输入之间的分布间隙,在保留底层语义意图和可读性的同时规避安全约束。我们在涵盖三种语言的多语言越狱数据集上评估LogiBreak,结果表明其在多种评估设置和语言环境中均具有有效性。


Geography-Aware Large Language Models for Next POI Recommendation

Abstract

arXiv:2505.13526v1 Announce Type: cross Abstract: The next Point-of-Interest (POI) recommendation task aims to predict users' next destinations based on their historical movement data and plays a key role in location-based services and personalized applications. Accurate next POI recommendation depends on effectively modeling geographic information and POI transition relations, which are crucial for capturing spatial dependencies and user movement patterns. While Large Language Models (LLMs) exhibit strong capabilities in semantic understanding and contextual reasoning, applying them to spatial tasks like next POI recommendation remains challenging. First, the infrequent nature of specific GPS coordinates makes it difficult for LLMs to model precise spatial contexts. Second, the lack of knowledge about POI transitions limits their ability to capture potential POI-POI relationships. To address these issues, we propose GA-LLM (Geography-Aware Large Language Model), a novel framework that enhances LLMs with two specialized components. The Geographic Coordinate Injection Module (GCIM) transforms GPS coordinates into spatial representations using hierarchical and Fourier-based positional encoding, enabling the model to understand geographic features from multiple perspectives. The POI Alignment Module (PAM) incorporates POI transition relations into the LLM's semantic space, allowing it to infer global POI relationships and generalize to unseen POIs. Experiments on three real-world datasets demonstrate the state-of-the-art performance of GA-LLM.

摘要

下一代兴趣点(POI)推荐任务旨在基于用户历史移动数据预测其下一个目的地,在基于位置的服务和个性化应用中具有关键作用。准确的POI推荐依赖于对地理信息和POI转移关系的有效建模,这对捕捉空间依赖性和用户移动模式至关重要。尽管大语言模型(LLM)在语义理解和上下文推理方面表现出强大能力,但将其应用于POI推荐等空间任务仍存在挑战。首先,特定GPS坐标的低频特性使得LLM难以建模精确的空间上下文;其次,缺乏对POI转移关系的认知限制了其捕捉潜在POI关联的能力。为解决这些问题,我们提出GA-LLM(地理感知大语言模型)框架,通过两个专用组件增强LLM:地理坐标注入模块(GCIM)采用分层和基于傅里叶的位置编码将GPS坐标转化为空间表征,使模型能从多视角理解地理特征;POI对齐模块(PAM)将POI转移关系融入LLM语义空间,使其能推断全局POI关系并泛化至未见POI。在三个真实数据集上的实验证明了GA-LLM的先进性能。


Time-R1: Towards Comprehensive Temporal Reasoning in LLMs

Abstract

arXiv:2505.13508v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate impressive capabilities but lack robust temporal intelligence, struggling to integrate reasoning about the past with predictions and plausible generations of the future. Meanwhile, existing methods typically target isolated temporal skills, such as question answering about past events or basic forecasting, and exhibit poor generalization, particularly when dealing with events beyond their knowledge cutoff or requiring creative foresight. To address these limitations, we introduce \textit{Time-R1}, the first framework to endow a moderate-sized (3B-parameter) LLM with comprehensive temporal abilities: understanding, prediction, and creative generation. Our approach features a novel three-stage development path; the first two constitute a \textit{reinforcement learning (RL) curriculum} driven by a meticulously designed dynamic rule-based reward system. This framework progressively builds (1) foundational temporal understanding and logical event-time mappings from historical data, (2) future event prediction skills for events beyond its knowledge cutoff, and finally (3) enables remarkable generalization to creative future scenario generation without any fine-tuning. Strikingly, experiments demonstrate that Time-R1 outperforms models over 200 times larger, including the state-of-the-art 671B DeepSeek-R1, on highly challenging future event prediction and creative scenario generation benchmarks. This work provides strong evidence that thoughtfully engineered, progressive RL fine-tuning allows smaller, efficient models to achieve superior temporal performance, offering a practical and scalable path towards truly time-aware AI. To foster further research, we also release \textit{Time-Bench}, a large-scale multi-task temporal reasoning dataset derived from 10 years of news data, and our series of \textit{Time-R1} checkpoints.

摘要

大语言模型(LLMs)展现出令人印象深刻的能力,但缺乏稳健的时间智能,难以将关于过去的推理与未来预测及合理生成相结合。现有方法通常针对孤立的时间技能,如基于过去事件的问答或基础预测,且泛化能力较差,尤其在处理超出其知识截止时间的事件或需要创造性预见时表现不佳。为突破这些局限,我们提出\textit{Time-R1}——首个赋予中等规模(30亿参数)大语言模型全面时间能力的框架:理解、预测与创造性生成。该方法采用创新的三阶段开发路径,前两阶段构成由精心设计的动态规则奖励系统驱动的\textit{强化学习(RL)课程}。该框架逐步构建:(1)从历史数据中建立基础时间理解与逻辑事件-时间映射;(2)针对知识截止时间后事件的未来预测能力;最终(3)无需微调即可实现向创造性未来场景生成的显著泛化。引人注目的是,实验表明Time-R1在极具挑战性的未来事件预测和创意场景生成基准测试中,性能超越体积超过200倍的模型(包括671B参数的顶尖模型DeepSeek-R1)。这项工作有力证明:经过精心设计的渐进式RL微调可使小型高效模型获得更优的时间性能,为实现真正具有时间感知能力的人工智能提供了实用且可扩展的路径。为促进后续研究,我们还发布了基于10年新闻数据构建的大规模多任务时间推理数据集\textit{Time-Bench},以及\textit{Time-R1}系列检查点。


LLM-Based User Simulation for Low-Knowledge Shilling Attacks on Recommender Systems

Abstract

arXiv:2505.13528v1 Announce Type: cross Abstract: Recommender systems (RS) are increasingly vulnerable to shilling attacks, where adversaries inject fake user profiles to manipulate system outputs. Traditional attack strategies often rely on simplistic heuristics, require access to internal RS data, and overlook the manipulation potential of textual reviews. In this work, we introduce Agent4SR, a novel framework that leverages Large Language Model (LLM)-based agents to perform low-knowledge, high-impact shilling attacks through both rating and review generation. Agent4SR simulates realistic user behavior by orchestrating adversarial interactions, selecting items, assigning ratings, and crafting reviews, while maintaining behavioral plausibility. Our design includes targeted profile construction, hybrid memory retrieval, and a review attack strategy that propagates target item features across unrelated reviews to amplify manipulation. Extensive experiments on multiple datasets and RS architectures demonstrate that Agent4SR outperforms existing low-knowledge baselines in both effectiveness and stealth. Our findings reveal a new class of emergent threats posed by LLM-driven agents, underscoring the urgent need for enhanced defenses in modern recommender systems.

摘要

推荐系统(RS)正日益面临托攻击的威胁,攻击者通过注入虚假用户画像来操纵系统输出。传统攻击策略通常依赖简单启发式方法,需要获取系统内部数据,且忽视了文本评论的操纵潜力。本研究提出Agent4SR这一创新框架,利用基于大语言模型(LLM)的智能体,通过评分与评论生成实现低知识门槛、高影响力的托攻击。该框架通过协调对抗性交互、选择商品、分配评分及生成评论来模拟真实用户行为,同时保持行为合理性。我们的设计包含定向画像构建、混合记忆检索以及评论攻击策略——通过将目标商品特征扩散至无关评论中以放大操纵效果。在多数据集及多种推荐架构上的实验表明,Agent4SR在攻击效果与隐蔽性方面均优于现有低知识基线方法。本研究揭示了LLM驱动智能体带来的新型涌现威胁,凸显了现代推荐系统亟需加强防御的紧迫性。


RAGXplain: From Explainable Evaluation to Actionable Guidance of RAG Pipelines

Abstract

arXiv:2505.13538v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems show promise by coupling large language models with external knowledge, yet traditional RAG evaluation methods primarily report quantitative scores while offering limited actionable guidance for refining these complex pipelines. In this paper, we introduce RAGXplain, an evaluation framework that quantifies RAG performance and translates these assessments into clear insights that clarify the workings of its complex, multi-stage pipeline and offer actionable recommendations. Using LLM reasoning, RAGXplain converts raw scores into coherent narratives identifying performance gaps and suggesting targeted improvements. By providing transparent explanations for AI decision-making, our framework fosters user trust-a key challenge in AI adoption. Our LLM-based metric assessments show strong alignment with human judgments, and experiments on public question-answering datasets confirm that applying RAGXplain's actionable recommendations measurably improves system performance. RAGXplain thus bridges quantitative evaluation and practical optimization, empowering users to understand, trust, and enhance their AI systems.

摘要

检索增强生成(RAG)系统通过将大型语言模型与外部知识相结合展现出潜力,但传统的RAG评估方法主要报告定量分数,对于优化这些复杂流程提供的可操作指导有限。本文提出RAGXplain,一个既能量化RAG性能,又能将这些评估转化为清晰见解的评估框架,以阐明其复杂多阶段流程的工作原理并提供可操作建议。通过利用LLM推理,RAGXplain将原始分数转化为连贯的叙述,识别性能差距并建议针对性改进。通过为AI决策提供透明解释,我们的框架增强了用户信任——这是AI应用中的关键挑战。基于LLM的指标评估显示与人类判断高度一致,在公开问答数据集上的实验证实,应用RAGXplain的可操作建议能显著提升系统性能。RAGXplain由此架起了定量评估与实际优化之间的桥梁,使用户能够理解、信任并增强其AI系统。


Multi-head Temporal Latent Attention

Abstract

arXiv:2505.13544v1 Announce Type: cross Abstract: While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.

摘要

虽然Transformer自注意力机制具有强大的并行性,但其键值(KV)缓存会随序列长度线性增长,成为推理效率的瓶颈。近期提出的多头潜在注意力通过将KV缓存压缩至低秩潜在空间来解决该问题。本文提出多头时序潜在注意力(MTLA),进一步沿时间维度缩减KV缓存大小,显著降低自注意力推理的内存占用。MTLA采用超网络动态合并时序相邻的KV缓存向量。针对压缩KV缓存与处理序列长度不匹配的问题,提出步长感知因果掩码,确保高效并行训练并与推理行为保持一致。在语音翻译、语音识别、语音理解和文本摘要等任务上的实验表明,MTLA在保持与标准多头注意力(MHA)相当性能的同时,大幅提升推理速度并降低GPU内存占用。例如,在英德语音翻译任务中,MTLA在保持翻译质量的前提下,相比MHA实现5.3倍加速,GPU内存占用减少8.3倍。


AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

Abstract

arXiv:2505.13531v1 Announce Type: cross Abstract: Assessing Large Language Models (LLMs)' underlying value differences enables comprehensive comparison of their misalignment, cultural adaptability, and biases. Nevertheless, current value measurement datasets face the informativeness challenge: with often outdated, contaminated, or generic test questions, they can only capture the shared value orientations among different LLMs, leading to saturated and thus uninformative results. To address this problem, we introduce AdAEM, a novel, self-extensible assessment framework for revealing LLMs' inclinations. Distinct from previous static benchmarks, AdAEM can automatically and adaptively generate and extend its test questions. This is achieved by probing the internal value boundaries of a diverse set of LLMs developed across cultures and time periods in an in-context optimization manner. The optimization process theoretically maximizes an information-theoretic objective to extract the latest or culturally controversial topics, providing more distinguishable and informative insights about models' value differences. In this way, AdAEM is able to co-evolve with the development of LLMs, consistently tracking their value dynamics. Using AdAEM, we generate 12,310 questions grounded in Schwartz Value Theory, conduct an extensive analysis to manifest our method's validity and effectiveness, and benchmark the values of 16 LLMs, laying the groundwork for better value research.

摘要

评估大型语言模型(LLMs)的潜在价值差异,有助于全面比较其错位性、文化适应性和偏见。然而,当前价值测量数据集面临信息有效性挑战:测试问题往往过时、被污染或过于泛化,仅能捕捉不同LLMs之间的共同价值取向,导致结果趋于饱和且缺乏信息量。为解决这一问题,我们提出AdAEM——一种新型自扩展评估框架,用于揭示LLMs的价值倾向。与以往静态基准不同,AdAEM能通过上下文优化方式,自动适应性地生成并扩展测试问题。该方法通过探测跨文化、跨时期开发的多样化LLMs内部价值边界,在理论上最大化信息论目标以提取最新或具有文化争议性的话题,从而提供更具区分度和信息量的模型价值差异洞察。由此,AdAEM得以与LLMs发展同步进化,持续追踪其价值动态。基于AdAEM框架,我们生成12,310个以施瓦茨价值理论为基础的问题,通过广泛分析验证方法的有效性与可行性,并对16个LLMs进行价值基准测试,为更深入的价值研究奠定基础。


Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments

Abstract

arXiv:2505.13535v1 Announce Type: cross Abstract: Information extraction (IE) from Visually Rich Documents (VRDs) containing layout features along with text is a critical and well-studied task. Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information to label sequences/tokens as named entities or answers to specific questions. However, these approaches lack reasoning, are not able to infer values not explicitly present in documents, and do not generalize well to new formats. Generative LLM-based approaches proposed recently are capable of reasoning, but struggle to comprehend clues from document layout especially in previously unseen document formats, and do not show competitive performance in heterogeneous VRD benchmark datasets. In this paper, we propose BLOCKIE, a novel LLM-based approach that organizes VRDs into localized, reusable semantic textual segments called \textit&#123;semantic blocks&#125;, which are processed independently. Through focused and more generalizable reasoning,our approach outperforms the state-of-the-art on public VRD benchmarks by 1-3% in F1 scores, is resilient to document formats previously not encountered and shows abilities to correctly extract information not explicitly present in documents.

摘要

从包含文本和布局特征的视觉丰富文档(VRD)中进行信息抽取(IE)是一项关键且被广泛研究的任务。传统的非大语言模型(LLM)自然语言处理解决方案通常需要同时利用文本和几何信息训练模型,以将序列/标记标注为命名实体或特定问题的答案。然而,这些方法缺乏推理能力,无法推断文档中未明确出现的值,并且对新格式的泛化能力较差。近期提出的基于生成式LLM的方法虽具备推理能力,但难以理解文档布局中的线索(尤其是在未见过的文档格式中),且在异构VRD基准数据集上未能展现出竞争优势。本文提出BLOCKIE,一种基于LLM的创新方法,该方法将VRD组织为可局部化、可复用的语义文本片段(称为\textit&#123;语义块&#125;),并对其进行独立处理。通过聚焦且更具泛化性的推理,我们的方法在公开VRD基准测试中以F1分数领先现有最佳技术1-3%,对未接触过的文档格式具有强韧性,并能正确提取文档中未明确呈现的信息。


InterFeat: An Automated Pipeline for Finding Interesting Hypotheses in Structured Biomedical Data

Abstract

arXiv:2505.13534v1 Announce Type: cross Abstract: Finding interesting phenomena is the core of scientific discovery, but it is a manual, ill-defined concept. We present an integrative pipeline for automating the discovery of interesting simple hypotheses (feature-target relations with effect direction and a potential underlying mechanism) in structured biomedical data. The pipeline combines machine learning, knowledge graphs, literature search and Large Language Models. We formalize "interestingness" as a combination of novelty, utility and plausibility. On 8 major diseases from the UK Biobank, our pipeline consistently recovers risk factors years before their appearance in the literature. 40--53% of our top candidates were validated as interesting, compared to 0--7% for a SHAP-based baseline. Overall, 28% of 109 candidates were interesting to medical experts. The pipeline addresses the challenge of operationalizing "interestingness" scalably and for any target. We release data and code: https://github.com/LinialLab/InterFeat

摘要

发现有趣现象是科学发现的核心,但这通常是一个人工操作且定义模糊的概念。我们提出了一种集成化流程,用于自动化发现结构化生物医学数据中有趣的简单假设(具有效应方向和潜在机制的特征-目标关系)。该流程结合了机器学习、知识图谱、文献检索和大语言模型技术。我们将"有趣性"形式化为新颖性、实用性和合理性的组合。在英国生物银行的8种主要疾病数据上,我们的流程能持续发现比文献记载早数年的风险因素。在排名靠前的候选假设中,40%-53%被验证为具有研究价值,而基于SHAP的基线方法仅为0%-7%。总体而言,医学专家认为109个候选假设中有28%具有研究意义。该流程解决了"有趣性"可扩展操作化及适用于任意目标的挑战。我们公开了数据和代码:https://github.com/LinialLab/InterFeat


Exploring Federated Pruning for Large Language Models

Abstract

arXiv:2505.13547v1 Announce Type: cross Abstract: LLM pruning has emerged as a promising technology for compressing LLMs, enabling their deployment on resource-limited devices. However, current methodologies typically require access to public calibration samples, which can be challenging to obtain in privacy-sensitive domains. To address this issue, we introduce FedPrLLM, a comprehensive federated pruning framework designed for the privacy-preserving compression of LLMs. In FedPrLLM, each client only needs to calculate a pruning mask matrix based on its local calibration data and share it with the server to prune the global model. This approach allows for collaborative pruning of the global model with the knowledge of each client while maintaining local data privacy. Additionally, we conduct extensive experiments to explore various possibilities within the FedPrLLM framework, including different comparison groups, pruning strategies, and the decision to scale weights. Our extensive evaluation reveals that one-shot pruning with layer comparison and no weight scaling is the optimal choice within the FedPrLLM framework. We hope our work will help guide future efforts in pruning LLMs in privacy-sensitive fields. Our code is available at https://github.com/Pengxin-Guo/FedPrLLM.

摘要

大语言模型(LLM)剪枝技术作为一种有前景的模型压缩方法,能够推动LLM在资源受限设备上的部署。然而,现有方法通常需要获取公共校准样本,这在隐私敏感领域往往难以实现。为解决该问题,我们提出了FedPrLLM——一个专为隐私保护型LLM压缩设计的综合联邦剪枝框架。在FedPrLLM中,各客户端仅需基于本地校准数据计算剪枝掩码矩阵,并与服务器共享以完成全局模型剪枝。该方法在维护本地数据隐私的同时,实现了基于各客户端知识的全局模型协同剪枝。此外,我们通过大量实验探究了FedPrLLM框架下的多种可能性,包括不同对比组、剪枝策略以及权重缩放决策。综合评估表明,采用层级比较且不缩放权重的一次性剪枝是FedPrLLM框架下的最优选择。本研究有望为隐私敏感领域的LLM剪枝工作提供指导。代码已开源:https://github.com/Pengxin-Guo/FedPrLLM。


Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation

Abstract

arXiv:2505.13554v1 Announce Type: cross Abstract: Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as little LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with minimal LLM usage, demonstrating effectiveness of our decider.

摘要

大语言模型(LLM)在多种下游任务中展现出卓越性能,例如机器翻译(MT)。然而,使用LLM进行翻译存在计算成本高和延迟显著的问题。根据我们的评估,在大多数情况下,LLM生成的翻译与神经机器翻译(NMT)系统相当。仅在特定场景下,LLM和NMT模型才表现出各自优势。因此,将NMT与LLM结合用于翻译,仅在必要时使用LLM,似乎是一种合理的解决方案。为此,需要一种调度策略,在确保快速翻译速度和尽可能少使用LLM的同时优化翻译结果。我们比较了多种调度策略,并提出了一种新颖且直接的决策器,该决策器利用源句特征。我们在多语言测试集上进行了大量实验,结果表明,我们能够以最少的LLM使用实现最佳翻译性能,证明了我们决策器的有效性。


Know Or Not: a library for evaluating out-of-knowledge base robustness

Abstract

arXiv:2505.13545v1 Announce Type: cross Abstract: While the capabilities of large language models (LLMs) have progressed significantly, their use in high-stakes applications have been limited due to risks of hallucination. One key approach in reducing hallucination is retrieval-augmented generation (RAG), but even in such setups, LLMs may still hallucinate when presented with questions outside of the knowledge base. Such behavior is unacceptable in high-stake applications where LLMs are expected to abstain from answering queries it does not have sufficient context on. In this work, we present a novel methodology for systematically evaluating out-of-knowledge base (OOKB) robustness of LLMs (whether LLMs know or do not know) in the RAG setting, without the need for manual annotation of gold standard answers. We implement our methodology in knowornot, an open-source library that enables users to develop their own customized evaluation data and pipelines for OOKB robustness. knowornot comprises four main features. Firstly, it provides a unified, high-level API that streamlines the process of setting up and running robustness benchmarks. Secondly, its modular architecture emphasizes extensibility and flexibility, allowing users to easily integrate their own LLM clients and RAG settings. Thirdly, its rigorous data modeling design ensures experiment reproducibility, reliability and traceability. Lastly, it implements a comprehensive suite of tools for users to customize their pipelines. We demonstrate the utility of knowornot by developing a challenging benchmark, PolicyBench, which spans four Question-Answer (QA) chatbots on government policies, and analyze its OOKB robustness. The source code of knowornot is available https://github.com/govtech-responsibleai/KnowOrNot.

摘要

尽管大语言模型(LLM)的能力已显著提升,但由于幻觉风险的存在,其在高风险应用中的使用仍受限制。减少幻觉的关键方法之一是检索增强生成(RAG),但即使在此类配置下,当面对知识库之外的问题时,LLM仍可能产生幻觉。这种行为在要求LLM对缺乏足够上下文支持的查询必须拒绝应答的高风险场景中是不可接受的。本研究提出了一种创新方法,用于系统评估RAG场景下LLM对知识库外(OOKB)问题的鲁棒性(即判断LLM是否知晓答案),且无需人工标注标准答案。我们将该方法实现在开源库knowornot中,该工具支持用户开发自定义的OOKB鲁棒性评估数据与流程。knowornot具备四大核心特性:首先,提供统一的高级API以简化鲁棒性基准测试的配置与执行流程;其次,采用模块化架构设计强调可扩展性与灵活性,便于用户集成自有LLM客户端与RAG配置;第三,通过严谨的数据建模设计确保实验的可复现性、可靠性与可追溯性;最后,内置全套工具支持用户定制化流程。我们通过构建涵盖四个政府政策问答机器人的高难度基准测试PolicyBench,验证了knowornot的实用性,并分析了其OOKB鲁棒性表现。项目源代码详见https://github.com/govtech-responsibleai/KnowOrNot。


VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation

Abstract

arXiv:2505.13577v1 Announce Type: cross Abstract: Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.

摘要

嗓音健康在人们的生活中起着至关重要的作用,显著影响其沟通能力和社交互动。然而,尽管嗓音障碍在全球范围内普遍存在,许多人仍难以获得便捷的诊断和治疗。本文介绍了VocalAgent,一种音频大语言模型(LLM),旨在通过嗓音健康诊断应对这些挑战。我们基于从医院患者现场收集的三个数据集对Qwen-Audio-Chat进行微调,并提出一个多维评估框架,包括用于减轻诊断偏差的安全性评估、跨语言性能分析和模态消融研究。与现有最先进的基线模型相比,VocalAgent在嗓音障碍分类方面表现出更高的准确性。其基于LLM的方法为健康诊断的更广泛采用提供了可扩展的解决方案,同时强调了伦理和技术验证的重要性。


AMAQA: A Metadata-based QA Dataset for RAG Systems

Abstract

arXiv:2505.13557v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) systems are widely used in question-answering (QA) tasks, but current benchmarks lack metadata integration, hindering evaluation in scenarios requiring both textual data and external information. To address this, we present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata. The integration of metadata is especially important in fields that require rapid analysis of large volumes of data, such as cybersecurity and intelligence, where timely access to relevant information is critical. AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups, enriched with metadata such as timestamps, topics, emotional tones, and toxicity indicators, which enable precise and contextualized queries by filtering documents based on specific criteria. It also includes 450 high-quality QA pairs, making it a valuable resource for advancing research on metadata-driven QA and RAG systems. To the best of our knowledge, AMAQA is the first single-hop QA benchmark to incorporate metadata and labels such as topics covered in the messages. We conduct extensive tests on the benchmark, establishing a new standard for future research. We show that leveraging metadata boosts accuracy from 0.12 to 0.61, highlighting the value of structured context. Building on this, we explore several strategies to refine the LLM input by iterating over provided context and enriching it with noisy documents, achieving a further 3-point gain over the best baseline and a 14-point improvement over simple metadata filtering. The dataset is available at https://anonymous.4open.science/r/AMAQA-5D0D/

摘要

检索增强生成(RAG)系统在问答(QA)任务中应用广泛,但现有基准测试缺乏元数据整合,阻碍了需要结合文本数据与外部信息的场景评估。为此,我们提出AMAQA——一个新型开放访问QA数据集,专为评估结合文本与元数据的任务而设计。元数据集成在需要快速分析海量数据的领域(如网络安全和情报)尤为重要,这些领域对及时获取相关信息有严格要求。AMAQA包含从26个公开Telegram群组收集的约110万条英文消息,并附有时间戳、主题、情感倾向和毒性指标等元数据,可通过特定条件筛选文档实现精准的上下文查询。该数据集还包含450组高质量QA对,为推进元数据驱动QA和RAG系统研究提供了宝贵资源。据我们所知,AMAQA是首个整合元数据及消息主题标签的单跳QA基准测试。我们对该基准进行了大量测试,为未来研究确立了新标准。实验表明,利用元数据可将准确率从0.12提升至0.61,凸显了结构化上下文的价值。在此基础上,我们探索了多种优化大语言模型输入的策略:通过迭代提供上下文并用噪声文档进行增强,相比最佳基线实现了3个百分点的提升,较简单元数据过滤则有14个百分点的改进。数据集详见https://anonymous.4open.science/r/AMAQA-5D0D/


Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression

Abstract

arXiv:2505.13563v1 Announce Type: cross Abstract: With the rise of the fine-tuned--pretrained paradigm, storing numerous fine-tuned models for multi-tasking creates significant storage overhead. Delta compression alleviates this by storing only the pretrained model and the highly compressed delta weights (the differences between fine-tuned and pretrained model weights). However, existing methods fail to maintain both high compression and performance, and often rely on data. To address these challenges, we propose UltraDelta, the first data-free delta compression pipeline that achieves both ultra-high compression and strong performance. UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions, using three key components: (1) Variance-Based Mixed Sparsity Allocation assigns sparsity based on variance, giving lower sparsity to high-variance layers to preserve inter-layer information. (2) Distribution-Aware Compression applies uniform quantization and then groups parameters by value, followed by group-wise pruning, to better preserve intra-layer distribution. (3) Trace-Norm-Guided Rescaling uses the trace norm of delta weights to estimate a global rescaling factor, improving model stability under higher compression. Extensive experiments across (a) large language models (fine-tuned on LLaMA-2 7B and 13B) with up to 133x, (b) general NLP models (RoBERTa-base, T5-base) with up to 800x, (c) vision models (ViT-B/32, ViT-L/14) with up to 400x, and (d) multi-modal models (BEiT-3) with 40x compression ratio, demonstrate that UltraDelta consistently outperforms existing methods, especially under ultra-high compression.

摘要

随着微调-预训练范式的兴起,为多任务存储大量微调模型带来了显著的存储开销。Delta压缩通过仅存储预训练模型和高度压缩的delta权重(微调与预训练模型权重之差)来缓解这一问题。然而,现有方法难以同时保持高压缩率与性能,且通常依赖数据。为解决这些挑战,我们提出UltraDelta——首个无需数据的delta压缩流程,既能实现超高压缩率又能保持强劲性能。该方案通过三个关键组件在层间、层内和全局维度实现冗余最小化、信息最大化与性能稳定化:(1)基于方差的混合稀疏分配根据方差分配稀疏度,对高方差层赋予较低稀疏度以保留层间信息;(2)分布感知压缩先进行均匀量化,再按参数值分组实施分组剪枝,更好地保持层内分布特性;(3)迹范数引导重缩放利用delta权重的迹范数估计全局缩放因子,提升高压缩率下的模型稳定性。在(a)大语言模型(LLaMA-2 7B/13B微调)最高133倍、(b)通用NLP模型(RoBERTa-base/T5-base)最高800倍、(c)视觉模型(ViT-B/32/ViT-L/14)最高400倍及(d)多模态模型(BEiT-3)40倍压缩比的广泛实验中,UltraDelta始终优于现有方法,尤其在超高压缩率下表现突出。


OMGPT: A Sequence Modeling Framework for Data-driven Operational Decision Making

Abstract

arXiv:2505.13580v1 Announce Type: cross Abstract: We build a Generative Pre-trained Transformer (GPT) model from scratch to solve sequential decision making tasks arising in contexts of operations research and management science which we call OMGPT. We first propose a general sequence modeling framework to cover several operational decision making tasks as special cases, such as dynamic pricing, inventory management, resource allocation, and queueing control. Under the framework, all these tasks can be viewed as a sequential prediction problem where the goal is to predict the optimal future action given all the historical information. Then we train a transformer-based neural network model (OMGPT) as a natural and powerful architecture for sequential modeling. This marks a paradigm shift compared to the existing methods for these OR/OM tasks in that (i) the OMGPT model can take advantage of the huge amount of pre-trained data; (ii) when tackling these problems, OMGPT does not assume any analytical model structure and enables a direct and rich mapping from the history to the future actions. Either of these two aspects, to the best of our knowledge, is not achieved by any existing method. We establish a Bayesian perspective to theoretically understand the working mechanism of the OMGPT on these tasks, which relates its performance with the pre-training task diversity and the divergence between the testing task and pre-training tasks. Numerically, we observe a surprising performance of the proposed model across all the above tasks.

摘要

我们构建了一个从头开始的生成式预训练变换器(GPT)模型,用于解决运筹学和管理科学中出现的序列决策任务,并将其命名为OMGPT。首先,我们提出了一个通用的序列建模框架,涵盖动态定价、库存管理、资源分配和排队控制等多个运营决策任务作为特例。在该框架下,所有这些任务均可视为一个序列预测问题,其目标是根据历史信息预测未来最优行动。随后,我们训练了一个基于Transformer的神经网络模型(OMGPT),作为序列建模的自然且强大的架构。与现有运筹学/运营管理任务方法相比,这标志着一个范式转变:(i)OMGPT模型能够利用海量预训练数据;(ii)在处理这些问题时,OMGPT无需假设任何解析模型结构,实现了从历史到未来行动的直接且丰富的映射。据我们所知,现有方法尚未实现这两个方面中的任何一个。我们建立了贝叶斯视角来从理论上理解OMGPT在这些任务上的工作机制,将其性能与预训练任务多样性及测试任务与预训练任务之间的差异联系起来。数值实验表明,所提模型在上述所有任务中均表现出卓越性能。


Are Large Language Models Good at Detecting Propaganda?

Abstract

arXiv:2505.13706v1 Announce Type: cross Abstract: Propagandists use rhetorical devices that rely on logical fallacies and emotional appeals to advance their agendas. Recognizing these techniques is key to making informed decisions. Recent advances in Natural Language Processing (NLP) have enabled the development of systems capable of detecting manipulative content. In this study, we look at several Large Language Models and their performance in detecting propaganda techniques in news articles. We compare the performance of these LLMs with transformer-based models. We find that, while GPT-4 demonstrates superior F1 scores (F1=0.16) compared to GPT-3.5 and Claude 3 Opus, it does not outperform a RoBERTa-CRF baseline (F1=0.67). Additionally, we find that all three LLMs outperform a MultiGranularity Network (MGN) baseline in detecting instances of one out of six propaganda techniques (name-calling), with GPT-3.5 and GPT-4 also outperforming the MGN baseline in detecting instances of appeal to fear and flag-waving.

摘要

宣传者常利用基于逻辑谬误与情感诉求的修辞手段来推动其议程。识别这些技术是做出理性决策的关键。自然语言处理(NLP)领域的最新进展使得开发能够检测操纵性内容的系统成为可能。本研究考察了多种大型语言模型在新闻文章宣传技术检测中的表现,并将其与基于Transformer的模型进行性能对比。研究发现:尽管GPT-4的F1分数(F1=0.16)优于GPT-3.5和Claude 3 Opus,但未超过RoBERTa-CRF基线模型(F1=0.67);此外,在六种宣传技术之一的"污名化"检测中,三种大型语言模型均优于多粒度网络(MGN)基线,而GPT-3.5和GPT-4在"诉诸恐惧"与"旗帜挥舞"两种技术的检测中也超越了MGN基线。


SayCoNav: Utilizing Large Language Models for Adaptive Collaboration in Decentralized Multi-Robot Navigation

Abstract

arXiv:2505.13729v1 Announce Type: cross Abstract: Adaptive collaboration is critical to a team of autonomous robots to perform complicated navigation tasks in large-scale unknown environments. An effective collaboration strategy should be determined and adapted according to each robot's skills and current status to successfully achieve the shared goal. We present SayCoNav, a new approach that leverages large language models (LLMs) for automatically generating this collaboration strategy among a team of robots. Building on the collaboration strategy, each robot uses the LLM to generate its plans and actions in a decentralized way. By sharing information to each other during navigation, each robot also continuously updates its step-by-step plans accordingly. We evaluate SayCoNav on Multi-Object Navigation (MultiON) tasks, that require the team of the robots to utilize their complementary strengths to efficiently search multiple different objects in unknown environments. By validating SayCoNav with varied team compositions and conditions against baseline methods, our experimental results show that SayCoNav can improve search efficiency by at most 44.28% through effective collaboration among heterogeneous robots. It can also dynamically adapt to the changing conditions during task execution.

摘要

自适应协作对于自主机器人团队在大规模未知环境中执行复杂导航任务至关重要。有效的协作策略应根据每个机器人的技能和当前状态进行动态调整,以实现共同目标。我们提出SayCoNav方法,利用大语言模型(LLM)自动生成机器人团队间的协作策略。基于该策略,各机器人以去中心化方式通过LLM生成自身规划与行动。在导航过程中通过信息共享,每个机器人持续更新其分步计划。我们在多目标导航(MultiON)任务上评估SayCoNav,该任务要求机器人团队利用互补优势在未知环境中高效搜索多个不同目标。通过对比不同团队构成和条件下的基线方法,实验结果表明:SayCoNav通过异构机器人间的有效协作,最高可提升44.28%的搜索效率,并能在任务执行过程中动态适应环境变化。


Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

Abstract

arXiv:2505.13738v1 Announce Type: cross Abstract: Efficient LLM pre-training requires well-tuned hyperparameters (HPs), including learning rate {\eta} and weight decay {\lambda}. We study scaling laws for HPs: formulas for how to scale HPs as we scale model size N, dataset size D, and batch size B. Recent work suggests the AdamW timescale, B/({\eta}{\lambda}D), should remain constant across training settings, and we verify the implication that optimal {\lambda} scales linearly with B, for a fixed N,D. However, as N,D scale, we show the optimal timescale obeys a precise power law in the tokens-per-parameter ratio, D/N. This law thus provides a method to accurately predict {\lambda}opt in advance of large-scale training. We also study scaling laws for optimal batch size Bopt (the B enabling lowest loss at a given N,D) and critical batch size Bcrit (the B beyond which further data parallelism becomes ineffective). In contrast with prior work, we find both Bopt and Bcrit scale as power laws in D, independent of model size, N. Finally, we analyze how these findings inform the real-world selection of Pareto-optimal N and D under dual training time and compute objectives.

摘要

高效的大型语言模型(LLM)预训练需要精心调整的超参数(HPs),包括学习率η和权重衰减λ。我们研究了超参数的缩放规律:即如何随着模型规模N、数据集规模D和批量大小B的缩放而调整超参数的公式。近期研究表明,AdamW时间尺度B/(ηλD)应在不同训练设置中保持恒定,我们验证了这一假设,即在固定N和D的情况下,最优λ与B呈线性比例关系。然而,随着N和D的缩放,我们发现最优时间尺度遵循令牌-参数比D/N的精确幂律关系。这一规律为在大规模训练前准确预测最优λopt提供了方法。我们还研究了最优批量大小Bopt(在给定N和D下实现最低损失的B)和临界批量大小Bcrit(超过此值后数据并行效果递减的B)的缩放规律。与先前研究不同,我们发现Bopt和Bcrit均表现为D的幂律函数,且与模型规模N无关。最后,我们分析了这些发现如何指导在实际场景中基于双重训练时间和计算目标选择帕累托最优的N和D。


RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs

Abstract

arXiv:2505.13697v1 Announce Type: cross Abstract: Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn't quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of "RL generating longer thinking traces." While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.

摘要

基于强化学习的大语言模型(LLM)后训练近期备受关注,尤其在DeepSeek R1应用GRPO进行微调后。当业界普遍将推理能力提升归因于RL后训练时,我们对此类方法的基础假设与建模框架进行了批判性审视。首先,我们指出将LLM训练建模为马尔可夫决策过程(MDP)时的常见结构假设,揭示这些假设会导致退化的MDP——其本质上并不需要RL/GRPO机制。两个关键结构假设包括:(1)将MDP状态简单定义为动作的串联(状态即上下文窗口,动作即LLM生成的标记);(2)将状态-动作轨迹的奖励均匀分配至整个轨迹。通过系统分析,我们证明这些简化假设使该方法实质上等同于结果驱动的监督学习。基于Qwen-2.5基础模型在GSM8K和Countdown基准上的实验表明,结合正负样本的迭代监督微调可获得与GRPO训练相当的性能。我们进一步论证:这些结构假设会间接激励RL生成更长的中间标记序列——这恰好契合"RL产生更长思维轨迹"的流行论述。尽管RL确实是提升LLM推理能力的有效技术,但我们的分析表明,现有LLM-RL框架对底层MDP的简化建模假设使其方法论基础与解释效力值得商榷。


Structured Agent Distillation for Large Language Model

Abstract

arXiv:2505.13820v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into {[REASON]} and {[ACT]} spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

摘要

大型语言模型(LLMs)通过交替进行推理与行动(如ReAct式框架所示),展现出作为决策智能体的强大能力。然而,其实际部署受限于高推理成本与大模型体积。我们提出结构化智能体蒸馏框架,将基于大型LLM的智能体压缩为更小的学生模型,同时保持推理保真度与行动一致性。与标准词元级蒸馏不同,本方法将轨迹分割为{[REASON]}(推理)与{[ACT]}(行动)片段,通过片段特异性损失函数使各组件与教师模型行为对齐。这种结构感知的监督机制使紧凑型智能体能更好地复现教师决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明,该方法始终优于词元级蒸馏与模仿学习基线,在显著压缩模型的同时实现性能最小化下降。扩展性与消融实验结果进一步验证了片段级对齐对高效可部署智能体的重要性。


Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques

Abstract

arXiv:2505.13766v1 Announce Type: cross Abstract: Software Quality Assurance (SQA) is critical for delivering reliable, secure, and efficient software products. The Software Quality Assurance Process aims to provide assurance that work products and processes comply with predefined provisions and plans. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance existing SQA processes by automating tasks like requirement analysis, code review, test generation, and compliance checks. Simultaneously, established standards such as ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001/ISO/IEC 90003, CMMI, and TMM provide structured frameworks for ensuring robust quality practices. This paper surveys the intersection of LLM-based SQA methods and these recognized standards, highlighting how AI-driven solutions can augment traditional approaches while maintaining compliance and process maturity. We first review the foundational software quality standards and the technical fundamentals of LLMs in software engineering. Next, we explore various LLM-based SQA applications, including requirement validation, defect detection, test generation, and documentation maintenance. We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard. Empirical case studies and open-source initiatives demonstrate the practical viability of these methods. At the same time, discussions on challenges (e.g., data privacy, model bias, explainability) underscore the need for deliberate governance and auditing. Finally, we propose future directions encompassing adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards for AI-driven software quality.

摘要

软件质量保证(SQA)对于交付可靠、安全且高效的软件产品至关重要。软件质量保证流程旨在确保工作成果和过程符合预定义条款与计划。大型语言模型(LLM)的最新进展为增强现有SQA流程提供了新机遇,可自动化实现需求分析、代码审查、测试生成和合规性检查等任务。同时,ISO/IEC 12207、ISO/IEC 25010、ISO/IEC 5055、ISO 9001/ISO/IEC 90003、CMMI和TMM等成熟标准为稳健的质量实践提供了结构化框架。本文研究了基于LLM的SQA方法与这些公认标准的交叉领域,重点阐述AI驱动解决方案如何在保持合规性和过程成熟度的同时增强传统方法。我们首先回顾基础软件质量标准及LLM在软件工程中的技术原理,继而探讨包括需求验证、缺陷检测、测试生成和文档维护在内的多种基于LLM的SQA应用。随后将这些应用映射至关键软件质量框架,阐明LLM如何满足各标准中的特定要求与指标。实证案例研究和开源计划证明了这些方法的实际可行性,同时关于数据隐私、模型偏差和可解释性等挑战的讨论强调了审慎治理与审计的必要性。最后,我们提出涵盖自适应学习、隐私优先部署、多模态分析及AI驱动软件质量标准演进等未来研究方向。


Domain Gating Ensemble Networks for AI-Generated Text Detection

Abstract

arXiv:2505.13855v1 Announce Type: cross Abstract: As state-of-the-art language models continue to improve, the need for robust detection of machine-generated text becomes increasingly critical. However, current state-of-the-art machine text detectors struggle to adapt to new unseen domains and generative models. In this paper we present DoGEN (Domain Gating Ensemble Networks), a technique that allows detectors to adapt to unseen domains by ensembling a set of domain expert detector models using weights from a domain classifier. We test DoGEN on a wide variety of domains from leading benchmarks and find that it achieves state-of-the-art performance on in-domain detection while outperforming models twice its size on out-of-domain detection. We release our code and trained models to assist in future research in domain-adaptive AI detection.

摘要

随着最先进语言模型的持续进步,对机器生成文本进行鲁棒检测的需求变得愈发关键。然而,当前最先进的机器文本检测器难以适应新的未见领域和生成模型。本文提出DoGEN(领域门控集成网络),该技术通过利用领域分类器的权重集成一组领域专家检测模型,使检测器能够适应未见领域。我们在领先基准测试的多种领域上验证DoGEN,发现其在领域内检测中达到最先进性能,同时在跨领域检测上优于两倍规模的模型。我们公开了代码和训练模型,以助力未来领域自适应AI检测的研究。


Preference Learning with Lie Detectors can Induce Honesty or Evasion

Abstract

arXiv:2505.13787v1 Announce Type: cross Abstract: As AI systems become more capable, deceptive behaviors can undermine evaluation and mislead users at deployment. Recent work has shown that lie detectors can accurately classify deceptive behavior, but they are not typically used in the training pipeline due to concerns around contamination and objective hacking. We examine these concerns by incorporating a lie detector into the labelling step of LLM post-training and evaluating whether the learned policy is genuinely more honest, or instead learns to fool the lie detector while remaining deceptive. Using DolusChat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies. In contrast, off-policy algorithms (DPO) consistently lead to deception rates under 25% for realistic TPRs. Our results illustrate a more complex picture than previously assumed: depending on the context, lie-detector-enhanced training can be a powerful tool for scalable oversight, or a counterproductive method encouraging undetectable misalignment.

摘要

随着AI系统能力不断提升,欺骗性行为可能破坏评估过程并在部署时误导用户。近期研究表明,谎言检测器能准确识别欺骗行为,但由于存在数据污染和目标破解的顾虑,这类检测器通常未被纳入训练流程。本研究通过将谎言检测器整合至大语言模型后训练的标注环节,评估习得策略是真正更诚实,还是仅学会规避检测器而保持欺骗性。基于DolusChat数据集(包含6.5万组真实/欺骗性回答配对的新颖数据集),我们确定了决定策略诚实度的三个关键因素:偏好学习中的探索量、谎言检测器准确率和KL正则化强度。研究发现,采用谎言检测器和GRPO算法的偏好学习可能导致策略学会规避检测器,欺骗率超过85%。然而当检测器真阳性率(TPR)或KL正则化足够高时,GRPO能习得诚实策略。相比之下,离策略算法(DPO)在实际TPR水平下始终将欺骗率控制在25%以内。研究结果揭示了比既往认知更复杂的图景:根据具体情境,基于谎言检测器的训练既可能成为可扩展监督的有效工具,也可能成为助长不可检测错位的适得其反之法。


Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Abstract

arXiv:2505.13775v1 Announce Type: cross Abstract: Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces and which are claimed to display behaviors like backtracking, self-verification etc.-actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver (in our case, A* search). By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or "Chains of Thought" induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.

摘要

近期大型推理模型取得的显著成果常被解读为思维链(CoT)方法的胜利,尤其是通过基于大型语言模型生成的CoT样本进行训练以发现新推理模式的过程。本文通过研究中间标记的语义(这些标记常被拟人化为"思考"或推理轨迹,并声称表现出回溯、自我验证等行为)如何实际影响模型性能,对这一解读提出批判性检验。我们在形式可验证的推理轨迹和解决方案上训练Transformer模型,将中间步骤和最终输出与形式化求解器(本研究中为A*搜索)的结果对齐。通过构建问题语义及目标算法的形式化解释器,我们不仅系统评估解决方案准确性,还评估中间轨迹的正确性,从而判断后者是否对前者存在因果影响。研究发现:尽管相较仅关注解决方案的基线有显著改进,但基于完全正确轨迹训练的模型在得出正确解时仍会产生无效推理轨迹。为进一步证明轨迹准确性与解决方案准确性仅存在松散关联,我们随后在噪声干扰的污染轨迹上训练模型(这些轨迹与配对问题无实质关联),发现模型性能不仅与正确数据训练的模型基本一致,在某些情况下甚至表现更优,且在分布外任务上展现出更强泛化能力。这些结果挑战了"中间标记或思维链会诱导可预测推理行为"的假设,警示研究者应避免对这些输出进行拟人化解读,或过度将其(尽管形式基本正确)视为语言模型具有类人或算法化行为的证据。


Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Abstract

arXiv:2505.13792v1 Announce Type: cross Abstract: Question Answering (QA) poses a challenging and critical problem, particularly in today's age of interactive dialogue systems such as ChatGPT, Perplexity, Microsoft Copilot, etc. where users demand both accuracy and transparency in the model's outputs. Since smaller language models (SLMs) are computationally more efficient but often under-perform compared to larger models, Knowledge Distillation (KD) methods allow for finetuning these smaller models to improve their final performance. Lately, the intermediate tokens or the so called `reasoning' traces produced by Chain-of-Thought (CoT) or by reasoning models such as DeepSeek R1 are used as a training signal for KD. However, these reasoning traces are often verbose and difficult to interpret or evaluate. In this work, we aim to address the challenge of evaluating the faithfulness of these reasoning traces and their correlation with the final performance. To this end, we employ a KD method leveraging rule-based problem decomposition. This approach allows us to break down complex queries into structured sub-problems, generating interpretable traces whose correctness can be readily evaluated, even at inference time. Specifically, we demonstrate this approach on Open Book QA, decomposing the problem into a Classification step and an Information Retrieval step, thereby simplifying trace evaluation. Our SFT experiments with correct and incorrect traces on the CoTemp QA, Microsoft Machine Reading Comprehension QA, and Facebook bAbI QA datasets reveal the striking finding that correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness. These results challenge the implicit assumption behind utilizing reasoning traces for improving SLMs' final performance via KD.

摘要

问答系统(QA)提出了一个具有挑战性且关键的问题,尤其在当今ChatGPT、Perplexity、Microsoft Copilot等交互式对话系统时代,用户对模型输出的准确性和透明度均有较高要求。由于小型语言模型(SLMs)计算效率更高,但性能通常不及大型模型,知识蒸馏(KD)方法可通过微调这些小型模型来提升其最终性能。近期,由思维链(CoT)或DeepSeek R1等推理模型产生的中间标记(即所谓"推理"轨迹)被用作KD的训练信号。然而,这些推理轨迹往往冗长且难以解释或评估。本研究旨在解决评估这些推理轨迹的忠实度及其与最终性能相关性的挑战。为此,我们采用了一种基于规则问题分解的KD方法,通过将复杂查询拆解为结构化子问题,生成可解释的轨迹——其正确性甚至在推理阶段也能被便捷评估。具体而言,我们在开放书籍QA任务中演示了该方法,将问题分解为分类步骤和信息检索步骤,从而简化轨迹评估。我们在CoTemp QA、Microsoft机器阅读理解QA和Facebook bAbI QA数据集上进行的监督微调实验(使用正确与错误轨迹)揭示了一个显著发现:正确轨迹并不必然意味着模型能输出正确的最终答案。同样地,我们发现最终答案正确性与中间轨迹正确性之间的相关性较低。这些结果对利用推理轨迹通过KD提升SLMs最终性能的隐含假设提出了挑战。


CLEVER: A Curated Benchmark for Formally Verified Code Generation

Abstract

arXiv:2505.13938v1 Announce Type: cross Abstract: We introduce &#123;\rm C&#123;\small LEVER&#125;&#125;, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, &#123;\rm C&#123;\small LEVER&#125;&#125; avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use &#123;\rm C&#123;\small LEVER&#125;&#125; to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).


FlashThink: An Early Exit Method For Efficient Reasoning

Abstract

arXiv:2505.13949v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown impressive performance in reasoning tasks. However, LLMs tend to generate excessively long reasoning content, leading to significant computational overhead. Our observations indicate that even on simple problems, LLMs tend to produce unnecessarily lengthy reasoning content, which is against intuitive expectations. Preliminary experiments show that at a certain point during the generation process, the model is already capable of producing the correct solution without completing the full reasoning content. Therefore, we consider that the reasoning process of the model can be exited early to achieve the purpose of efficient reasoning. We introduce a verification model that identifies the exact moment when the model can stop reasoning and still provide the correct answer. Comprehensive experiments on four different benchmarks demonstrate that our proposed method, FlashThink, effectively shortens the reasoning content while preserving the model accuracy. For the Deepseek-R1 and QwQ-32B models, we reduced the length of reasoning content by 77.04% and 77.47%, respectively, without reducing the accuracy.

摘要

大型语言模型(LLMs)在推理任务中展现出卓越性能,但其生成的推理内容往往过于冗长,导致显著的计算开销。我们通过观察发现,即使在简单问题上,LLMs也会产生超出必要长度的推理内容,这与直观预期相悖。初步实验表明,在生成过程的特定节点,模型无需完成完整推理即可得出正确答案。因此,我们认为可通过提前终止模型推理过程来实现高效推理。本文提出一种验证模型,用于精准识别模型可停止推理但仍能提供正确答案的时机。在四个不同基准测试上的全面实验证明,我们提出的FlashThink方法能在保持模型精度的同时有效缩短推理内容。对于Deepseek-R1和QwQ-32B模型,我们分别将推理内容长度减少了77.04%和77.47%,且未降低准确率。


EfficientLLM: Efficiency in Large Language Models

Abstract

arXiv:2505.13840v1 Announce Type: cross Abstract: Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

摘要

大型语言模型(LLMs)虽推动显著进展,但其激增的参数规模与上下文窗口导致计算、能耗及经济成本难以承受。本文提出EfficientLLM——首个系统性评估LLM效率技术的大规模基准研究,基于生产级集群(48×GH200,8×H200 GPU)展开实证分析。研究围绕三大核心维度:(1)架构预训练(高效注意力变体:MQA、GQA、MLA、NSA;稀疏混合专家系统MoE),(2)微调(参数高效方法:LoRA、RSLoRA、DoRA),(3)推理(量化方法:int4、float16)。通过定义六项细粒度指标(内存利用率、计算利用率、延迟、吞吐量、能耗、压缩率)量化硬件饱和度、延迟-吞吐平衡与碳成本。基于超100组模型-技术组合(0.5B-72B参数)的评估,得出三项核心结论:(i)效率存在可量化权衡:无普适最优方案,例如MoE降低FLOPs并提升精度但增加40%显存,而int4量化以3-5%精度代价实现3.9倍内存/能耗降低;(ii)最优解依赖任务与规模:MQA在资源受限设备上实现最佳内存-延迟权衡,MLA在质量敏感任务中困惑度最低,RSLoRA仅在14B参数以上超越LoRA效率;(iii)技术具备跨模态泛化性:扩展评估至大型视觉模型(Stable Diffusion 3.5、Wan 2.1)与视觉语言模型(Qwen2.5-VL),验证技术可迁移性。通过开源数据集、评估流程与排行榜,EfficientLLM为下一代基础模型的效率-性能权衡研究提供关键指导。


Do Language Models Use Their Depth Efficiently?

Abstract

arXiv:2505.13898v1 Announce Type: cross Abstract: Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1 and Qwen 3 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.

摘要

现代大型语言模型(LLM)的深度不断增加,且深度与性能呈正相关,尽管存在收益递减现象。然而,这些模型是否有效利用了其深度?它们是通过组合更多特征来创建浅层模型无法实现的高阶计算,还是仅仅将同类计算分散到更多层中?为探究这些问题,我们分析了Llama 3.1和Qwen 3系列模型的残差流。研究发现:首先,通过比较子层输出与残差流可见,后半部分网络层的贡献显著低于前半部分,且两部分之间存在明显的相位转变;其次,跳过后半部分网络层对未来计算和输出预测的影响远小于前半部分;第三,在多跳任务中,我们未能发现模型利用增加深度来组合多跳示例中子结果的证据;第四,我们尝试直接验证深层模型是否利用额外层执行新型计算。通过训练从浅层模型残差流到深层模型的线性映射,发现具有相同相对深度的网络层映射效果最佳,这表明更大模型仅是将同类计算分散至更多层中。所有证据表明,深层模型并未利用深度学习新型计算,而仅通过更大深度对残差进行更细粒度调整。这一发现可能有助于解释为何增加规模会导致堆叠Transformer架构的收益递减。


When LLMs meet open-world graph learning: a new perspective for unlabeled data uncertainty

Abstract

arXiv:2505.13989v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have significantly advanced text-attributed graph (TAG) learning. However, existing methods inadequately handle data uncertainty in open-world scenarios, especially concerning limited labeling and unknown-class nodes. Prior solutions typically rely on isolated semantic or structural approaches for unknown-class rejection, lacking effective annotation pipelines. To address these limitations, we propose Open-world Graph Assistant (OGA), an LLM-based framework that combines adaptive label traceability, which integrates semantics and topology for unknown-class rejection, and a graph label annotator to enable model updates using newly annotated nodes. Comprehensive experiments demonstrate OGA's effectiveness and practicality.

摘要

近期,大语言模型(LLMs)在文本属性图(TAG)学习领域取得了显著进展。然而,现有方法难以有效应对开放世界场景中的数据不确定性,特别是在有限标注和未知类别节点方面。先前解决方案通常依赖孤立的语义或结构方法进行未知类别拒识,缺乏有效的标注流程。为解决这些局限性,我们提出开放世界图助手(OGA),这是一个基于LLM的框架,结合了自适应标签追溯机制(通过融合语义与拓扑结构实现未知类别拒识)和图标签标注器(利用新标注节点实现模型更新)。综合实验验证了OGA的有效性与实用性。


APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight

Abstract

arXiv:2505.13921v1 Announce Type: cross Abstract: Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or adaptive decision-making through Reinforcement Learning (RL), but they fail to capture dynamic object interactions or require task-specific training, limiting their real-world applicability. We introduce APEX (Anticipatory Physics-Enhanced Execution), a framework that equips LLMs with physics-driven foresight for real-time task planning. APEX constructs structured graphs to identify and model the most relevant dynamic interactions in the environment, providing LLMs with explicit physical state updates. Simultaneously, APEX provides low-latency forward simulations of physically feasible actions, allowing LLMs to select optimal strategies based on predictive outcomes rather than static observations. We evaluate APEX on three benchmarks designed to assess perception, prediction, and decision-making: (1) Physics Reasoning Benchmark, testing causal inference and object motion prediction; (2) Tetris, evaluating whether physics-informed prediction enhances decision-making performance in long-horizon planning tasks; (3) Dynamic Obstacle Avoidance, assessing the immediate integration of perception and action feasibility analysis. APEX significantly outperforms standard LLMs and VLM-based models, demonstrating the necessity of explicit physics reasoning for bridging the gap between language-based intelligence and real-world task execution. The source code and experiment setup are publicly available at https://github.com/hwj20/APEX_EXP .

摘要

大语言模型(LLMs)展现出强大的推理与任务规划能力,但在物理交互建模方面仍存在根本性局限。现有方法通过视觉语言模型(VLMs)整合感知能力,或借助强化学习(RL)实现自适应决策,但这些方案要么无法捕捉动态物体交互,要么需要针对特定任务进行训练,限制了其现实适用性。我们提出APEX(Anticipatory Physics-Enhanced Execution)框架,该框架通过物理驱动的预见性赋能LLMs进行实时任务规划。APEX构建结构化图来识别并建模环境中最相关的动态交互,为LLMs提供显式的物理状态更新。同时,APEX提供低延迟的物理可行动作前向模拟,使LLMs能基于预测结果(而非静态观测)选择最优策略。我们在三个基准测试中评估APEX:(1)物理推理基准,测试因果推理与物体运动预测;(2)俄罗斯方块,评估物理信息预测是否能提升长时程规划任务的决策性能;(3)动态避障,检验感知与动作可行性分析的即时整合。APEX显著优于标准LLMs和基于VLM的模型,证明了显式物理推理对于弥合语言智能与现实任务执行间差距的必要性。源代码与实验设置公开于https://github.com/hwj20/APEX_EXP。


Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

Abstract

arXiv:2505.13973v1 Announce Type: cross Abstract: Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.

摘要

近期,基于强化学习(RL)的调优方法改变了多模态大语言模型(MLLMs)的发展轨迹,尤其是在引入群体相对策略优化(GRPO)之后。然而,将其直接应用于医疗任务仍难以实现符合临床要求的模型行为。出于使模型响应与临床预期保持一致的需求,我们研究了影响基于RL的调优在医疗视觉问答(VQA)中有效性的四个关键维度:基础模型初始化策略、医学语义对齐的作用、基于长度的奖励对长链推理的影响,以及偏差的影响。我们通过大量实验分析这些因素对医疗MLLMs的作用,为领域特异性微调提供了新见解。此外,实验结果还表明,基于GRPO的RL调优在准确性和推理质量上均持续优于标准监督微调(SFT)。


Memory-Centric Embodied Question Answer

Abstract

arXiv:2505.13948v1 Announce Type: cross Abstract: Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models' memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability's pivotal role in resolving complex tasks.

摘要

具身问答(EQA)要求智能体通过自主探索和理解环境来回答与上下文相关的问题。现有框架通常以规划器为核心,由其引导停止模块、记忆模块和回答模块进行推理。本文提出一种以记忆为中心的EQA框架MemoryEQA。与规划器主导的EQA模型不同(其记忆模块无法与其他模块充分交互),MemoryEQA能灵活地将记忆信息馈送至所有模块,从而提升处理复杂任务(如跨区域多目标场景)的效率和准确性。具体而言,我们建立了多模态分层记忆机制:全局记忆存储语言增强的场景地图,局部记忆保留历史观测与状态信息。执行EQA任务时,利用多模态大语言模型将记忆信息转换为所需输入格式并注入不同模块。为评估EQA模型的记忆能力,我们在HM3D基础上构建了MT-HM3D数据集,包含1,587个涉及跨区域多目标的问答对,要求智能体持续保持对探索所获目标信息的记忆。在HM-EQA、MT-HM3D和OpenEQA上的实验结果表明,本框架在MT-HM3D上相较基线模型19.8%的性能提升,进一步印证了记忆能力在解决复杂任务中的关键作用。


CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring

Abstract

arXiv:2505.13965v1 Announce Type: cross Abstract: Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed, evidence-grounded strengths; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using state-of-the-art MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, especially for grammatical and lexical diversity. Our proposed CAFES framework paves the way for an intelligent multimodal AES system. The code will be available upon acceptance.

摘要

自动作文评分(AES)在现代教育中至关重要,尤其是随着多模态评估日益普及。然而,传统AES方法在评估泛化性和多模态感知方面存在不足,而近期基于多模态大语言模型(MLLM)的方法也可能产生与人类判断不符的幻觉式评分理由。为应对这些局限,我们提出CAFES——首个专为AES设计的协作多智能体框架。该框架协调三个专业智能体:初始评分器进行快速、特质化评估;反馈池管理器聚合基于证据的详细优势;反思评分器根据反馈迭代优化分数以提升人类对齐性。采用最先进MLLM的广泛实验表明,该框架在二次加权卡帕系数(QWK)上较基准平均相对提升21%,尤其在语法和词汇多样性方面表现突出。CAFES框架为智能多模态AES系统的发展开辟了新路径。代码将在论文录用后公开。


EEG-to-Text Translation: A Model for Deciphering Human Brain Activity

Abstract

arXiv:2505.13936v1 Announce Type: cross Abstract: With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at https://github.com/Mmurrad/EEG-To-text.

摘要

随着Gemini、GPT等大型语言模型的快速发展,弥合人脑与语言处理之间的差距已成为重要研究领域。为应对这一挑战,研究者已开发出多种将脑电信号解码为文本的模型,但这些模型仍存在显著的性能局限。为突破现有缺陷,我们提出新型模型R1 Translator,旨在提升脑电到文本的解码性能。该模型将双向LSTM编码器与预训练的基于Transformer的解码器相结合,利用脑电特征生成高质量文本输出。模型通过LSTM处理脑电嵌入以捕捉序列依赖性,继而输入Transformer解码器进行有效文本生成。R1 Translator在ROUGE指标上表现优异,显著超越T5(先前研究)和Brain Translator模型:其ROUGE-1精确率达38.00%,较T5(34.89%)提升达9%,较Brain(35.69%)提高3%;ROUGE-L的F1值为32.51%,分别领先T5(29.67%)3%和Brain(30.38%)2%。在字符错误率(CER)方面,R1达到0.5795,较T5(0.5917)降低2%,较Brain(0.6001)下降4%;词错误率(WER)指标为0.7280,较T5(0.7610)提升4.3%,较Brain(0.7553)改进3.6%。


Improved Methods for Model Pruning and Knowledge Distillation

Abstract

arXiv:2505.14052v1 Announce Type: cross Abstract: Model pruning is a performance optimization technique for large language models like R1 or o3-mini. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the human-computer interaction phase. Our goal is to obtain a much smaller and faster knowledge distilled model that can quickly generate content almost as good as those of the unpruned ones. We propose MAMA Pruning, short for Movement and Magnitude Analysis, an improved pruning method that effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias fixed in the pre-training phase and GRPO rewards verified during the post-training phase as our novel pruning indicators. Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.

摘要

模型剪枝是针对R1或o3-mini等大型语言模型的性能优化技术。然而现有剪枝方法往往导致显著性能下降或需要大量重训练与微调。该技术旨在识别并移除人机交互阶段中贡献可能性较低的神经元连接,目标是获得更小、更快的知识蒸馏模型,使其能快速生成与未剪枝模型质量相近的内容。我们提出MAMA剪枝法(移动与幅度分析缩写),这种改进的剪枝方法能有效减小模型规模与计算复杂度,同时在极端剪枝水平下仍保持与原始未剪枝模型相当的性能。改进方法基于预训练阶段固定的权重、偏置,以及训练后阶段经GRPO奖励验证的新型剪枝指标。初步实验结果表明,本方法在不同剪枝水平和下游计算语言学任务中均优于或媲美现有最先进方法。


From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Abstract

arXiv:2505.14045v1 Announce Type: cross Abstract: Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

摘要

在大规模多语言数据上进行持续预训练和指令微调已被证明能有效将大语言模型(LLMs)扩展至低资源语言。然而,此类数据的非对齐特性限制了其有效捕捉跨语言语义的能力。相比之下,多向平行数据(即相同内容在多种语言间对齐)能提供更强的跨语言一致性,并为提升多语言性能带来更大潜力。本文基于TED演讲构建了一个大规模高质量多向平行语料库TED2025,涵盖113种语言,最多支持50种语言并行对齐,确保广泛的多语言覆盖。利用该数据集,我们探索了利用多向平行数据增强LLMs的最佳实践,包括持续预训练策略、指令微调方法以及关键影响因素分析。在六个多语言基准测试上的实验表明,基于多向平行数据训练的模型始终优于非对齐多语言数据训练的模型。


Social Sycophancy: A Broader Understanding of LLM Sycophancy

Abstract

arXiv:2505.13995v1 Announce Type: cross Abstract: A serious risk to the safety and utility of LLMs is sycophancy, i.e., excessive agreement with and flattery of the user. Yet existing work focuses on only one aspect of sycophancy: agreement with users' explicitly stated beliefs that can be compared to a ground truth. This overlooks forms of sycophancy that arise in ambiguous contexts such as advice and support-seeking, where there is no clear ground truth, yet sycophancy can reinforce harmful implicit assumptions, beliefs, or actions. To address this gap, we introduce a richer theory of social sycophancy in LLMs, characterizing sycophancy as the excessive preservation of a user's face (the positive self-image a person seeks to maintain in an interaction). We present ELEPHANT, a framework for evaluating social sycophancy across five face-preserving behaviors (emotional validation, moral endorsement, indirect language, indirect action, and accepting framing) on two datasets: open-ended questions (OEQ) and Reddit's r/AmITheAsshole (AITA). Across eight models, we show that LLMs consistently exhibit high rates of social sycophancy: on OEQ, they preserve face 47% more than humans, and on AITA, they affirm behavior deemed inappropriate by crowdsourced human judgments in 42% of cases. We further show that social sycophancy is rewarded in preference datasets and is not easily mitigated. Our work provides theoretical grounding and empirical tools (datasets and code) for understanding and addressing this under-recognized but consequential issue.

摘要

大型语言模型(LLMs)安全性与实用性的重大风险在于谄媚行为,即过度认同和恭维用户。然而现有研究仅关注谄媚的一个方面:对用户可验证的明确声明的认同。这忽略了在建议和寻求支持等模糊情境中出现的谄媚形式——虽无明确事实依据,却可能强化有害的隐性假设、信念或行为。为填补这一空白,我们提出了LLM社会谄媚的深化理论,将其定义为对用户"面子"(人际互动中个体试图维持的积极自我形象)的过度维护。我们开发了ELEPHANT评估框架,通过在开放式问题(OEQ)和Reddit的r/AmITheAsshole(AITA)两个数据集上分析五种保面子行为(情感验证、道德认可、间接语言、间接行动和接受框架),发现八种模型均持续表现出高社会谄媚率:在OEQ中保面子行为比人类多47%,在AITA中42%的情况下会肯定被众包人类判定为不当的行为。研究进一步表明,社会谄媚在偏好数据集中受到奖励且难以缓解。本工作为理解并解决这一未被充分认识但影响深远的问题提供了理论基础和实证工具(数据集与代码)。


A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Abstract

arXiv:2505.14106v1 Announce Type: cross Abstract: We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

摘要

我们提出PersonaConvBench,这是一个用于评估大型语言模型(LLMs)在多轮对话中个性化推理与生成能力的大规模基准测试平台。与现有仅关注个性化或对话结构的研究不同,PersonaConvBench将两者有机结合,提供三大核心任务:句子分类、影响回归以及跨十个基于Reddit的多样化领域的用户中心文本生成。该设计能系统分析个性化对话语境如何在真实多用户场景中塑造LLM的输出。我们在统一提示设置下对多个商业和开源LLM进行基准测试,发现融入个性化历史记录能带来显著性能提升——在情感分类任务中相较最佳非对话基线的相对增益高达198%。通过发布包含评估方案和代码的PersonaConvBench,我们旨在支持以下研究方向:适应个体风格、追踪长期语境、并生成语境丰富且引人入胜响应的LLM技术。


Field Matters: A lightweight LLM-enhanced Method for CTR Prediction

Abstract

arXiv:2505.14057v1 Announce Type: cross Abstract: Click-through rate (CTR) prediction is a fundamental task in modern recommender systems. In recent years, the integration of large language models (LLMs) has been shown to effectively enhance the performance of traditional CTR methods. However, existing LLM-enhanced methods often require extensive processing of detailed textual descriptions for large-scale instances or user/item entities, leading to substantial computational overhead. To address this challenge, this work introduces LLaCTR, a novel and lightweight LLM-enhanced CTR method that employs a field-level enhancement paradigm. Specifically, LLaCTR first utilizes LLMs to distill crucial and lightweight semantic knowledge from small-scale feature fields through self-supervised field-feature fine-tuning. Subsequently, it leverages this field-level semantic knowledge to enhance both feature representation and feature interactions. In our experiments, we integrate LLaCTR with six representative CTR models across four datasets, demonstrating its superior performance in terms of both effectiveness and efficiency compared to existing LLM-enhanced methods. Our code is available at https://anonymous.4open.science/r/LLaCTR-EC46.

摘要

摘要:点击率(CTR)预测是现代推荐系统中的核心任务。近年来,大型语言模型(LLM)的整合已被证明能有效提升传统CTR方法的性能。然而,现有的LLM增强方法通常需要对大规模实例或用户/物品实体的详细文本描述进行大量处理,导致显著的计算开销。为解决这一问题,本研究提出了LLaCTR,一种新颖且轻量级的LLM增强CTR方法,采用字段级增强范式。具体而言,LLaCTR首先利用LLM通过自监督的字段-特征微调,从小规模特征字段中提炼关键且轻量级的语义知识;随后,该方法利用这种字段级语义知识来增强特征表示和特征交互。实验中,我们将LLaCTR与四种数据集上的六种代表性CTR模型相结合,结果表明其相较于现有LLM增强方法在效果和效率方面均具有优越性。代码发布于https://anonymous.4open.science/r/LLaCTR-EC46。


AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Abstract

arXiv:2505.14103v1 Announce Type: cross Abstract: Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they achieve suboptimal effectiveness, applicability, and practicability, particularly, assuming that the adversary can fully manipulate user prompts. In this work, we first conduct an extensive experiment showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to speech (TTS) techniques. We then propose AudioJailbreak, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audio does not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios will not raise the awareness of victims by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating the reverberation distortion effect with room impulse response into the generation of the perturbations. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, or over-the-air robustness. Moreover, AudioJailbreak is also applicable to the adversary who cannot fully manipulate user prompts, thus has a much broader attack scenario. Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AudioJailbreak. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their security robustness. The implementation and audio samples are available at our website https://audiojailbreak.github.io/AudioJailbreak.

摘要

大型音频语言模型(LALMs)的越狱攻击近期受到研究,但现有方法在有效性、适用性和实用性方面存在不足,尤其是假设攻击者能完全操控用户输入。本研究首先通过大量实验证明:先进的文本越狱攻击无法通过文本转语音(TTS)技术直接迁移至端到端LALMs。继而提出AudioJailbreak新型音频越狱攻击,其特点包括:(1)异步性:通过构造后缀型越狱音频,无需在时间轴上与用户输入对齐;(2)普适性:通过将多组输入融入扰动生成,单个扰动即可适配不同输入;(3)隐蔽性:采用多样化意图隐藏策略,避免触发受害者警觉;(4)空传鲁棒性:通过结合房间脉冲响应的混响失真效应生成扰动,确保音频在空传后仍有效。相较之下,现有音频越狱攻击均无法实现异步性、普适性、隐蔽性或空传鲁棒性。此外,AudioJailbreak对无法完全操控用户输入的攻击者同样适用,显著拓宽了攻击场景。基于迄今最全面的LALMs实验表明该方法具有高效性。本研究揭示了音频越狱攻击对LALMs的安全威胁,切实推动其安全鲁棒性提升。


Gender Trouble in Language Models: An Empirical Audit Guided by Gender Performativity Theory

Abstract

arXiv:2505.14080v1 Announce Type: cross Abstract: Language models encode and subsequently perpetuate harmful gendered stereotypes. Research has succeeded in mitigating some of these harms, e.g. by dissociating non-gendered terms such as occupations from gendered terms such as 'woman' and 'man'. This approach, however, remains superficial given that associations are only one form of prejudice through which gendered harms arise. Critical scholarship on gender, such as gender performativity theory, emphasizes how harms often arise from the construction of gender itself, such as conflating gender with biological sex. In language models, these issues could lead to the erasure of transgender and gender diverse identities and cause harms in downstream applications, from misgendering users to misdiagnosing patients based on wrong assumptions about their anatomy. For FAccT research on gendered harms to go beyond superficial linguistic associations, we advocate for a broader definition of 'gender bias' in language models. We operationalize insights on the construction of gender through language from gender studies literature and then empirically test how 16 language models of different architectures, training datasets, and model sizes encode gender. We find that language models tend to encode gender as a binary category tied to biological sex, and that gendered terms that do not neatly fall into one of these binary categories are erased and pathologized. Finally, we show that larger models, which achieve better results on performance benchmarks, learn stronger associations between gender and sex, further reinforcing a narrow understanding of gender. Our findings lead us to call for a re-evaluation of how gendered harms in language models are defined and addressed.

摘要

语言模型编码并持续强化有害的性别刻板印象。现有研究已成功缓解部分此类危害,例如将"职业"等非性别化术语与"女人""男人"等性别化术语解耦。然而这种方法仍流于表面,因为关联仅是产生性别偏见的其中一种形式。性别表演性理论等批判性性别研究强调,危害往往源于性别本身的建构过程,例如将性别与生理性别混为一谈。在语言模型中,这些问题可能导致跨性别与多元性别身份的抹除,并在下游应用中造成伤害——从对用户的错误性别指认,到基于错误解剖假设的误诊。

为使FAccT关于性别危害的研究超越浅层语言关联,我们主张对语言模型中"性别偏见"采用更广义的定义。通过将性别研究文献中关于语言建构性别的研究成果操作化,我们实证检验了16种不同架构、训练数据集和模型规模的语言模型如何编码性别。研究发现:语言模型倾向于将性别编码为与生理性别绑定的二元类别,而无法明确归入此类二元体系的性别术语则被抹除或病理化。最后我们证明,在性能基准测试中表现更优的大规模模型,会习得更强的性别与生理性别关联,进一步强化对性别的狭隘认知。这些发现促使我们呼吁重新评估语言模型中性别危害的定义与应对方式。


MAS-KCL: Knowledge component graph structure learning with large language model-based agentic workflow

Abstract

arXiv:2505.14126v1 Announce Type: cross Abstract: Knowledge components (KCs) are the fundamental units of knowledge in the field of education. A KC graph illustrates the relationships and dependencies between KCs. An accurate KC graph can assist educators in identifying the root causes of learners' poor performance on specific KCs, thereby enabling targeted instructional interventions. To achieve this, we have developed a KC graph structure learning algorithm, named MAS-KCL, which employs a multi-agent system driven by large language models for adaptive modification and optimization of the KC graph. Additionally, a bidirectional feedback mechanism is integrated into the algorithm, where AI agents leverage this mechanism to assess the value of edges within the KC graph and adjust the distribution of generation probabilities for different edges, thereby accelerating the efficiency of structure learning. We applied the proposed algorithm to 5 synthetic datasets and 4 real-world educational datasets, and experimental results validate its effectiveness in learning path recognition. By accurately identifying learners' learning paths, teachers are able to design more comprehensive learning plans, enabling learners to achieve their educational goals more effectively, thus promoting the sustainable development of education.

摘要

知识组件(KCs)是教育领域中的基本知识单元。KC图展示了各知识组件间的关联与依赖关系。精确的KC图能帮助教育者定位学习者在特定知识组件上表现不佳的根源,从而实现精准的教学干预。为此,我们开发了一种名为MAS-KCL的KC图结构学习算法,该算法采用基于大语言模型驱动的多智能体系统,对KC图进行自适应修改与优化。算法中同时嵌入了双向反馈机制,AI智能体通过该机制评估KC图中边的价值,并调整不同边的生成概率分布,从而加速结构学习的效率。我们将所提算法应用于5个合成数据集和4个真实教育数据集,实验结果验证了其在学习路径识别方面的有效性。通过准确识别学习者的学习路径,教师能够制定更完善的学习计划,使学习者更高效地达成教育目标,从而促进教育的可持续发展。


DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

Abstract

arXiv:2505.14107v1 Announce Type: cross Abstract: The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AIs diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.

摘要

具有复杂推理能力的突破性大型语言模型的出现,为解决各类科学挑战(包括复杂临床场景中的问题)带来了重要前景。为确保其在现实医疗环境中安全有效地部署,亟需对现有模型的诊断能力进行系统性基准测试。鉴于现有医学基准在评估高级诊断推理方面的局限性,我们提出了DiagnosisArena——一个全面且具有挑战性的基准测试,旨在严格评估专业级诊断能力。该基准包含1,113对分段患者病例及对应诊断,涵盖28个医学专科,源自10种顶级医学期刊发表的临床病例报告。通过由AI系统和人类专家参与的多轮筛选与评审的严谨构建流程,并实施彻底的数据泄露防范检查,最终完成基准开发。我们的研究表明,即使最先进的推理模型o3-mini、o1和DeepSeek-R1,其准确率也仅分别达到45.82%、31.09%和17.79%。这一发现揭示了当前大型语言模型在临床诊断推理挑战中存在的显著泛化瓶颈。通过DiagnosisArena,我们旨在推动AI诊断推理能力的进一步发展,为现实临床诊断挑战提供更有效的解决方案。我们公开该基准及评估工具以供进一步研究开发:https://github.com/SPIRAL-MED/DiagnosisArena。


Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Abstract

arXiv:2505.14136v1 Announce Type: cross Abstract: Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose Test-Time Model Merging (TTMM) which scales the MoE paradigm to an order of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, TTMM is more than 100x faster than TTT at test-time by amortizing the cost of TTT at train-time. Thus, TTMM offers a promising cost-effective approach to scale test-time training.

摘要

专家混合(MoE)模型是一种在不增加推理成本的情况下提升模型容量的有效方法,已成为众多前沿语言模型的核心组件。然而,由于训练和推理成本过高,当前MoE模型通常仅使用少量专家。我们提出测试时模型融合(TTMM)方法,将MoE范式扩展至数量级更多的专家,并通过模型融合几乎消除所有测试时开销。研究表明,TTMM可视为测试时训练(TTT)的近似方法——后者需针对每个预测任务(即提示)对专家模型进行微调。尽管TTT近期被证明能显著提升语言模型性能,但其计算代价高昂。我们发现TTMM性能随专家数量增加而提升,并逐渐逼近TTT的表现。此外,当基础模型参数达10亿时,TTMM通过训练阶段分摊TTT成本,测试时速度比TTT快100倍以上。因此,TTMM为扩展测试时训练提供了一种具有成本效益的可行方案。


Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

Abstract

arXiv:2505.14178v1 Announce Type: cross Abstract: Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.

摘要

词元化是语言模型中首个——且常被低估的——计算层。尽管思维链(CoT)提示通过外显中间步骤使Transformer模型能够近似循环计算,但我们证明此类推理的成功从根本上受限于词元化输入的结构。本研究从理论和实证角度探讨了词元化方案(尤其是基于子词的字节对编码等方法)如何通过合并或模糊原子推理单元来阻碍符号计算。我们提出"词元感知"概念,用以形式化低劣的词元粒度如何破坏逻辑对齐并阻碍模型泛化符号过程。通过对算术和符号任务的系统评估,我们证明词元结构会显著影响推理性能——即使采用CoT仍会导致失败,而原子对齐的格式却能释放强大的泛化能力,使小模型(如GPT-4o-mini)在结构化推理中超越大系统(如o1)。研究结果表明,大语言模型的符号推理能力并非纯粹取决于架构,而是深度依赖于词元层面的表征。


Abstract

arXiv:2505.14156v1 Announce Type: cross Abstract: Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.

摘要

会话搜索涉及一系列交互式查询与操作,旨在满足用户复杂的信息需求。现有策略通常优先采用序列建模以实现深层语义理解,却忽视了交互过程中的图结构特征。部分方法虽关注结构信息捕获,但仅采用通用文档表示,未能实现词级语义建模。本文提出符号化图排序器(SGR),通过结合大型语言模型(LLMs)的优势,协同利用基于文本与基于图结构的处理方法。具体而言,我们首先引入一组符号化语法规则将会话图转化为文本,从而将会话历史、交互过程和任务指令无缝整合为LLM的输入。鉴于基于文本语料预训练的LLM与我们通过图-文本语法生成的符号语言存在固有差异,我们的目标是增强LLM在文本格式下捕捉图结构的能力。为此,我们设计了一套自监督符号学习任务,包括链接预测、节点内容生成和生成式对比学习,使LLM能够从粗粒度到细粒度逐步捕获拓扑信息。在AOL和Tiangong-ST两个基准数据集上的实验结果与综合分析验证了本方法的优越性。该研究范式还为弥合传统搜索策略与现代LLM之间的鸿沟提供了一种新颖有效的方法论。


Prior Prompt Engineering for Reinforcement Fine-Tuning

Abstract

arXiv:2505.14157v1 Announce Type: cross Abstract: This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.

摘要

本文研究了强化微调(RFT)背景下的先验提示工程(pPE),其中语言模型(LM)通过奖励信号被激励表现出最大化性能的行为。尽管现有RFT研究主要集中于算法、奖励塑造和数据整理,但先验提示的设计——即在训练时附加于查询前的指令,用于引发逐步推理等行为——仍未得到充分探索。我们探究不同pPE方法是否能引导LM在RFT后内化不同行为。受推理时提示工程(iPE)启发,我们将五种代表性iPE策略(推理、规划、基于代码的推理、知识回忆和空示例利用)转化为相应的pPE方法。我们使用Qwen2.5-7B模型对每种pPE方法进行实验,随后在领域内和领域外基准测试(如AIME2024、HumanEval+和GPQA-Diamond)上评估性能。结果表明,所有经过pPE训练的模型都优于其iPE提示的对应模型,其中空示例pPE方法实现了最大的平均性能提升,并在AIME2024和GPQA-Diamond上获得最高改进,超越了常用的推理方法。此外,通过采用行为分类框架,我们证明不同pPE策略会在最终模型中注入不同的行为风格。这些发现将pPE定位为RFT中强大但尚未被充分研究的重要维度。


FLASH-D: FlashAttention with Hidden Softmax Division

Abstract

arXiv:2505.14201v1 Announce Type: cross Abstract: The transformer's attention mechanism has revolutionized AI and machine learning, with its efficient computation being crucial to its performance. However, calculating attention involves matrix operations interspersed with softmax rescaling, which inherently slows down computation and requires processing the entire input sequence. Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic, enabling tiled computation independent of sequence length. While optimized for GPUs, FlashAttention's simplicity makes it amenable to direct hardware acceleration. This work re-evaluates the core FlashAttention kernel, presenting FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials, eliminating the need for maximum value subtraction; and (c) a reduction in computational cost without introducing numerical approximations to the FlashAttention kernel. Importantly, the essential FlashAttention properties that facilitate efficient tiled implementation are fully preserved. Hardware implementation results at 28nm demonstrate that this proposed formulation achieves a 22.8% reduction in area and a 20.3% reduction in power, on average, compared to state-of-the-art parallel hardware architectures without any performance penalty.

摘要

Transformer的注意力机制彻底改变了人工智能和机器学习领域,其高效计算对性能至关重要。然而,传统注意力计算涉及矩阵运算与softmax重缩放交替进行,这本质上会降低计算速度并需要处理整个输入序列。基于在线softmax计算技术,FlashAttention将softmax计算与矩阵运算相融合,实现了不受序列长度限制的分块计算。虽然该算法针对GPU进行了优化,但其简洁性使其易于直接进行硬件加速。本研究重新评估了FlashAttention核心算法,提出数学等效但更简化的FLASH-D方案,其特点包括:(a) 将softmax除法隐藏于其他非线性函数计算中;(b) 实现指数计算的固有数值稳定性,无需进行最大值减法;(c) 在不引入数值近似的前提下降低FlashAttention内核的计算成本。关键的是,该方法完整保留了支持高效分块实现的FlashAttention核心特性。28纳米工艺的硬件实现结果表明,与最先进的并行硬件架构相比,该方案在保持性能不变的同时,平均实现了22.8%的面积缩减和20.3%的功耗降低。


Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

Abstract

arXiv:2505.14185v1 Announce Type: cross Abstract: Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

摘要

大语言模型(LLMs)依赖安全对齐机制来生成符合社会规范的响应,这一目标通常通过指令微调和基于人类反馈的强化学习实现。然而,这种对齐具有脆弱性:即使在良性或轻微污染数据上进行微调,也可能破坏安全性并重新引发有害行为。现有研究表明,对齐可能对应着权重空间中可识别的几何方向,这些方向构成的子空间理论上可被隔离或保存以抵御失准现象。本研究对该几何视角进行了全面实证分析,探究了安全相关行为是否集中于特定子空间、能否与通用学习相分离,以及危害性是否源于内部表征中的可区分模式。在参数空间和激活空间的实验中,我们得出一致结论:增强安全行为的子空间同样会放大不安全行为,且不同安全属性的提示会激活重叠的表征。未发现存在选择性控制安全性的子空间证据。这些结果挑战了"对齐具有几何局部性"的假设,表明安全性产生于模型整体学习动态中相互纠缠的高影响力组件,而非独立方向。这提示基于子空间的防御方法可能存在根本性局限,亟需开发持续训练中保持对齐的新策略。我们在五个开源LLMs上通过多组实验验证了上述发现。代码公开于:https://github.com/CERT-Lab/safety-subspaces。


Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

Abstract

arXiv:2505.14212v1 Announce Type: cross Abstract: A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and BERTScore. Comprehensive experiments demonstrate improvements in logical coherence and factual accuracy, with implications for developing adaptable Artificial Intelligence (AI) systems. Mistral-7b-v0.3 outperforms Llama-3-8b with BERT F1, BLEU, and ROUGE scores 0.858, 0.172, and 0.260 of for the LLM generated QA pairs compared to scores of 0.836, 0.083, and 0.139 for the human annotated QA pairs.

摘要

问答(QA)系统旨在知识库中检索合适答案。现有QA系统在处理需要复杂推理或实时知识整合的查询时存在困难,通常需辅以检索增强生成(RAG)等数据源检索技术。然而,RAG在应对复杂推理与多源信息间逻辑关联方面仍面临挑战。本文提出一种通过自动生成基于上下文的QA对来增强大语言模型(LLMs)在知识密集型QA任务中表现的新方法。该技术利用LLMs生成微调数据,降低对人类标注的依赖,同时提升模型理解与推理能力。所提出的系统包含自动化QA生成器与模型微调器,采用困惑度、ROUGE、BLEU和BERTScore进行评估。综合实验表明,该方法在逻辑连贯性与事实准确性方面均有提升,对开发适应性人工智能(AI)系统具有启示意义。在LLM生成的QA对中,Mistral-7b-v0.3以0.858的BERT F1值、0.172的BLEU值和0.260的ROUGE值优于Llama-3-8b,而人工标注QA对的对应得分分别为0.836、0.083和0.139。


"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs

Abstract

arXiv:2505.14226v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become increasingly powerful, with multilingual and multimodal capabilities improving by the day. These models are being evaluated through audits, alignment studies and red-teaming efforts to expose model vulnerabilities towards generating harmful, biased and unfair content. Existing red-teaming efforts have previously focused on the English language, using fixed template-based attacks; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially in the multimodal context. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce two new jailbreak strategies that show higher effectiveness than baseline strategies. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. Our novel prompts achieve a 99% Attack Success Rate for text generation and 78% for image generation, with Attack Relevance Rate of 100% for text generation and 95% for image generation when using the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words.

摘要

大型语言模型(LLMs)的能力日益强大,其多语言和多模态特性日臻完善。当前通过审计、对齐研究和红队测试等方法评估这些模型时,主要暴露了其生成有害、偏见及不公平内容的脆弱性。现有红队测试多聚焦于英语领域,采用固定模板攻击策略,导致模型仍易受多语言越狱策略影响,尤其在多模态情境下。本研究提出一种创新策略,通过代码混合和语音扰动实现文本与图像生成任务的双重越狱,并引入两种较基线策略更高效的新型越狱方法。我们提出的方法通过将语音拼写错误应用于代码混合提示中的敏感词,在保持可解释性的同时有效绕过LLMs的安全过滤器。实验表明,采用语音扰动的代码混合提示在文本生成中达到99%的攻击成功率(ASR),图像生成达78%;其攻击相关率(ARR)分别为100%和95%。可解释性实验揭示语音扰动通过影响词汇标记化实现越狱成功。本研究呼吁加强多语言多模态模型的泛化安全对齐研究,特别是在现实场景中存在拼写错误的提示情境下。


FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation

Abstract

arXiv:2505.14256v1 Announce Type: cross Abstract: In this paper, we present FuxiMT, a novel Chinese-centric multilingual machine translation model powered by a sparsified large language model (LLM). We adopt a two-stage strategy to train FuxiMT. We first pre-train the model on a massive Chinese corpus and then conduct multilingual fine-tuning on a large parallel dataset encompassing 65 languages. FuxiMT incorporates Mixture-of-Experts (MoEs) and employs a curriculum learning strategy for robust performance across various resource levels. Experimental results demonstrate that FuxiMT significantly outperforms strong baselines, including state-of-the-art LLMs and machine translation models, particularly under low-resource scenarios. Furthermore, FuxiMT exhibits remarkable zero-shot translation capabilities for unseen language pairs, indicating its potential to bridge communication gaps where parallel data are scarce or unavailable.

摘要

本文提出FuxiMT——一种基于稀疏化大语言模型(LLM)的新型以中文为核心的多语言机器翻译模型。我们采用两阶段策略训练FuxiMT:首先在大规模中文语料库上进行预训练,随后在包含65种语言的大型平行数据集上进行多语言微调。该模型融合专家混合(MoEs)机制,并采用课程学习策略以确保在不同资源水平下均能保持稳健性能。实验结果表明,FuxiMT显著优于包括最先进LLM和机器翻译模型在内的强基线系统,尤其在低资源场景下表现突出。此外,该模型对未见过的语言对展现出卓越的零样本翻译能力,表明其在平行数据稀缺或缺失场景下具有弥合沟通鸿沟的潜力。


ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models

Abstract

arXiv:2505.14238v1 Announce Type: cross Abstract: Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA's expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.

摘要

大型语言模型在广泛任务中展现出强大性能,但如何高效适应新领域仍是关键挑战。参数高效微调(PEFT)方法通过引入轻量级可训练模块并保持大部分预训练权重固定来解决此问题。主流方法LoRA采用低秩分解建模更新,但其表达能力本质上受秩的制约。HiRA等近期方法试图通过引入与冻结权重的Hadamard积来增强表达能力,但仍依赖于预训练模型的结构。我们提出ABBA——一种新型PEFT架构,将更新重新参数化为两个独立可学习低秩矩阵的Hadamard积。与现有工作不同,ABBA将更新与预训练权重完全解耦,使两个组件均可自由优化。这能在相同参数预算下实现显著更高的表达能力。我们通过矩阵重构实验对ABBA的表达能力进行理论分析并验证其优势。实证表明,ABBA在算术和常识推理基准测试中取得最先进成果,在多个模型上始终以显著优势超越现有PEFT方法。代码已开源:https://github.com/CERT-Lab/abba。


Mechanistic Fine-tuning for In-context Learning

Abstract

arXiv:2505.14233v1 Announce Type: cross Abstract: In-context Learning (ICL) utilizes structured demonstration-query inputs to induce few-shot learning on Language Models (LMs), which are not originally pre-trained on ICL-style data. To bridge the gap between ICL and pre-training, some approaches fine-tune LMs on large ICL-style datasets by an end-to-end paradigm with massive computational costs. To reduce such costs, in this paper, we propose Attention Behavior Fine-Tuning (ABFT), utilizing the previous findings on the inner mechanism of ICL, building training objectives on the attention scores instead of the final outputs, to force the attention scores to focus on the correct label tokens presented in the context and mitigate attention scores from the wrong label tokens. Our experiments on 9 modern LMs and 8 datasets empirically find that ABFT outperforms in performance, robustness, unbiasedness, and efficiency, with only around 0.01% data cost compared to the previous methods. Moreover, our subsequent analysis finds that the end-to-end training objective contains the ABFT objective, suggesting the implicit bias of ICL-style data to the emergence of induction heads. Our work demonstrates the possibility of controlling specific module sequences within LMs to improve their behavior, opening up the future application of mechanistic interpretability.

摘要

情境学习(ICL)通过结构化示范-查询输入诱导语言模型(LM)进行小样本学习,而此类模型最初并未在ICL风格数据上进行预训练。为弥合ICL与预训练间的差距,现有方法采用端到端范式在大型ICL风格数据集上微调LM,但需付出巨大计算成本。为降低此类成本,本文提出注意力行为微调(ABFT),基于ICL内部机制的研究发现,在注意力分数而非最终输出上构建训练目标,迫使注意力分数聚焦于上下文中正确的标签标记,同时抑制对错误标签标记的关注。我们在9个现代LM和8个数据集上的实验表明,ABFT在性能、鲁棒性、无偏性和效率方面均表现优异,数据消耗量仅为先前方法的0.01%。进一步分析发现,端到端训练目标包含ABFT目标,暗示ICL风格数据对归纳头涌现存在隐性偏好。本研究证实了通过控制LM内部特定模块序列来改善其行为的可行性,为机制可解释性的未来应用开辟了新途径。


Think-J: Learning to Think for Generative LLM-as-a-Judge

Abstract

arXiv:2505.14268v1 Announce Type: cross Abstract: LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline RL requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.

摘要

LLM-as-a-Judge(大语言模型作为评判者)是指对大语言模型(LLMs)生成响应的偏好进行自动建模,这对LLM评估和奖励建模具有重要意义。尽管生成式LLMs在各种任务中取得了显著进展,但其作为LLM-Judge的表现仍不尽如人意。本研究提出Think-J方法,通过学习思考过程来改进生成式LLM-as-a-Judge。我们首先利用少量精选数据开发具备初始判断思维能力的模型,随后基于强化学习(RL)优化判断思维轨迹。我们提出了两种判断思维优化方法,分别基于离线和在线RL。离线RL需训练批评模型以构建正负样本供学习;在线方法则通过定义基于规则的奖励作为优化反馈。实验结果表明,我们的方法能显著提升生成式LLM-Judge的评估能力,在不依赖额外人工标注的情况下,超越生成式和基于分类器的LLM-Judge。


YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

Abstract

arXiv:2505.14279v1 Announce Type: cross Abstract: Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.

摘要

大语言模型(LLMs)推动了现代搜索引擎上的科学问答,但其评估鲁棒性仍未得到充分探索。我们提出YESciEval,一个开源框架,结合基于细粒度量表的评估与强化学习,以减轻LLM评估者的乐观偏差。我们发布了多学科科学问答数据集(包括对抗性变体)及多个LLM的评估分数。该方法独立于专有模型和人类反馈,支持可扩展、零成本的评估。通过推进可靠的"LLM即评委"模型,本研究支持AI对齐,并促进对科学探索和通用人工智能至关重要的鲁棒、透明评估。


Speculative Decoding Reimagined for Multimodal Large Language Models

Abstract

arXiv:2505.14260v1 Announce Type: cross Abstract: This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD uses a two-stage training strategy: In stage one, the draft model is trained on text-only instruction-tuning datasets to improve its language modeling ability. In stage two, MSD gradually introduces multimodal data to enhance the visual perception capability of the draft model. Experiments show that MSD boosts inference speed by up to 2.29×2.29\times for LLaVA-1.5-7B and up to 2.46×2.46\times for LLaVA-1.5-13B on multimodal benchmarks, demonstrating its effectiveness. Our code is available at https://github.com/Lyn-Lucy/MSD.

摘要

本文提出多模态推测解码(MSD)方法以加速多模态大语言模型(MLLMs)的推理。推测解码技术已被证明能在不损失准确性的前提下加速大语言模型(LLMs),但现有针对MLLMs的推测解码方法无法实现与LLMs同等的加速效果。为此,我们针对MLLMs的特性重新设计了推测解码框架。通过分析MLLMs的特征,我们提炼出MSD的两大核心设计原则:(1)文本与视觉标记存在本质差异,需在草拟阶段分别处理;(2)语言建模能力与视觉感知能力对草拟模型均至关重要。针对第一原则,MSD在草拟模型中解耦文本与视觉标记,使其根据各自特性独立处理;针对第二原则,MSD采用两阶段训练策略:第一阶段在纯文本指令调优数据集上训练草拟模型以提升语言建模能力,第二阶段逐步引入多模态数据以增强草拟模型的视觉感知能力。实验表明,MSD在多模态基准测试中为LLaVA-1.5-7B和LLaVA-1.5-13B分别带来最高2.29倍和2.46倍的推理加速,验证了其有效性。代码已开源:https://github.com/Lyn-Lucy/MSD。


Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion

Abstract

arXiv:2505.14316v1 Announce Type: cross Abstract: Although large language models (LLMs) have achieved remarkable advancements, their security remains a pressing concern. One major threat is jailbreak attacks, where adversarial prompts bypass model safeguards to generate harmful or objectionable content. Researchers study jailbreak attacks to understand security and robustness of LLMs. However, existing jailbreak attack methods face two main challenges: (1) an excessive number of iterative queries, and (2) poor generalization across models. In addition, recent jailbreak evaluation datasets focus primarily on question-answering scenarios, lacking attention to text generation tasks that require accurate regeneration of toxic content. To tackle these challenges, we propose two contributions: (1) ICE, a novel black-box jailbreak method that employs Intent Concealment and divErsion to effectively circumvent security constraints. ICE achieves high attack success rates (ASR) with a single query, significantly improving efficiency and transferability across different models. (2) BiSceneEval, a comprehensive dataset designed for assessing LLM robustness in question-answering and text-generation tasks. Experimental results demonstrate that ICE outperforms existing jailbreak techniques, revealing critical vulnerabilities in current defense mechanisms. Our findings underscore the necessity of a hybrid security strategy that integrates predefined security mechanisms with real-time semantic decomposition to enhance the security of LLMs.

摘要

尽管大语言模型(LLMs)已取得显著进展,但其安全性仍是紧迫问题。越狱攻击作为主要威胁之一,通过对抗性提示绕过模型防护机制以生成有害或不当内容。研究者通过分析越狱攻击来理解LLMs的安全性与鲁棒性。然而现有越狱攻击方法面临两大挑战:(1)迭代查询次数过多;(2)跨模型泛化能力差。此外,当前越狱评估数据集主要关注问答场景,忽视了需要精确再生有毒内容的文本生成任务。针对这些问题,我们提出两项贡献:(1)ICE方法——采用意图隐藏与诱导转移的新型黑盒越狱技术,能有效规避安全约束。该方法单次查询即可实现高攻击成功率(ASR),显著提升跨模型效率与迁移性;(2)BiSceneEval数据集——专为评估LLMs在问答与文本生成任务中的鲁棒性而设计。实验表明ICE优于现有越狱技术,揭示了当前防御机制的关键漏洞。本研究结果强调需要整合预定义安全机制与实时语义解析的混合安全策略,以增强LLMs的安全性。


Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Abstract

arXiv:2505.14469v1 Announce Type: cross Abstract: Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.

摘要

大型语言模型(LLM)的最新进展引发了重大安全隐患,尤其在处理混合编码的输入输出时。本研究系统性地探讨了LLM相较于单语英语提示,对混合编码提示产生不安全输出的更高敏感性。通过可解释性方法,我们剖析了导致模型有害行为的内部归因转变机制。此外,我们通过区分普遍不安全与文化特异性不安全查询,探索了文化维度的影响。本文提出了新颖的实验发现,阐明了驱动该现象的内在机制。


MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Abstract

arXiv:2505.14395v1 Announce Type: cross Abstract: Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks (rr > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.

摘要

评估大型语言模型(LLM)的文本生成能力具有挑战性,尤其对于低资源语言而言,直接评估方法十分匮乏。我们提出MUG-Eval这一新颖框架,通过将现有基准测试转化为对话任务并测量模型在这些任务上的准确率,来评估LLM的多语言生成能力。这些对话任务专门设计为要求目标语言的有效沟通能力,而后我们直接以任务成功率作为对话生成效果的代理指标。该方法具有两大关键优势:其一不依赖语言特定的NLP工具或标注数据集(这对大多数语言而言都较为稀缺),其二无需借助LLM作为评判者(其评估质量在少数高资源语言之外会显著下降)。我们在涵盖高、中、低资源等级的30种语言上评估了8个LLM,发现MUG-Eval与现有基准测试保持强相关性(r > 0.75),同时能实现跨语言与跨模型的标准化比较。该框架为多语言生成评估提供了稳健且资源高效的解决方案,可扩展应用于数千种语言。


Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI

Abstract

arXiv:2505.14435v1 Announce Type: cross Abstract: As organizations increasingly rely on AI systems for decision support in sustainability contexts, it becomes critical to understand the inherent biases and perspectives embedded in Large Language Models (LLMs). This study systematically investigates how five state-of-the-art LLMs -- Claude, DeepSeek, GPT, LLaMA, and Mistral - conceptualize sustainability and its relationship with AI. We administered validated, psychometric sustainability-related questionnaires - each 100 times per model -- to capture response patterns and variability. Our findings revealed significant inter-model differences: For example, GPT exhibited skepticism about the compatibility of AI and sustainability, whereas LLaMA demonstrated extreme techno-optimism with perfect scores for several Sustainable Development Goals (SDGs). Models also diverged in attributing institutional responsibility for AI and sustainability integration, a results that holds implications for technology governance approaches. Our results demonstrate that model selection could substantially influence organizational sustainability strategies, highlighting the need for awareness of model-specific biases when deploying LLMs for sustainability-related decision-making.

摘要

随着各组织日益依赖人工智能系统在可持续发展领域提供决策支持,理解大语言模型(LLMs)中固有的偏见与观点变得至关重要。本研究系统考察了五种前沿大语言模型——Claude、DeepSeek、GPT、LLaMA和Mistral——如何概念化可持续发展及其与人工智能的关系。我们向每个模型各100次施测经过验证的、心理测量学相关的可持续发展问卷,以捕捉响应模式与变异性。研究发现存在显著的模型间差异:例如GPT表现出对人工智能与可持续发展兼容性的怀疑态度,而LLaMA则显示出极端的技术乐观主义,在多项可持续发展目标(SDGs)上获得满分。各模型在归因人工智能与可持续发展融合的制度责任方面也存在分歧,这一结果对技术治理方法具有启示意义。研究表明模型选择可能显著影响组织的可持续发展战略,强调在部署大语言模型进行可持续发展相关决策时,需要充分意识到模型特有的偏见。


Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

Abstract

arXiv:2505.14398v1 Announce Type: cross Abstract: While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model's ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.

摘要

虽然人类能够自然地通过过往经验进行学习与适应,但大语言模型(LLMs)及其代理系统难以保留先前任务的推理过程并将其应用于未来场景。为突破这一局限,我们提出了一种创新框架——日志增强生成(LAG),该框架在测试阶段直接复用历史日志中的计算与推理结果,从而增强模型从既往任务中学习的能力,并在处理全新挑战时表现更优,同时保持系统的高效性与可扩展性。具体而言,我们的系统采用键值(KV)缓存来表征任务日志,既完整编码历史任务的推理上下文,又仅针对选定标记子集存储KV缓存。当新任务出现时,LAG会从相关日志中检索KV值以增强生成效果。本方法与基于反思的记忆机制不同,它无需经过知识提取或蒸馏的额外步骤即可直接复用先前的推理与计算结果。该方法也超越了现有KV缓存技术——后者主要追求效率提升而非准确率改进。在知识与推理密集型数据集上的实验表明,我们的方法显著优于未使用日志的标准代理系统,以及基于反思和KV缓存技术的现有解决方案。


Creative Preference Optimization

Abstract

arXiv:2505.14442v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have demonstrated impressive performance across natural language generation tasks, their ability to generate truly creative content-characterized by novelty, diversity, surprise, and quality-remains limited. Existing methods for enhancing LLM creativity often focus narrowly on diversity or specific tasks, failing to address creativity's multifaceted nature in a generalizable way. In this work, we propose Creative Preference Optimization (CrPO), a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion. We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset spanning over 200,000 human-generated responses and ratings from more than 30 psychological creativity assessments. Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations, producing more novel, diverse, and surprising generations while maintaining high output quality. Additional evaluations on NoveltyBench further confirm the generalizability of our approach. Together, our results demonstrate that directly optimizing for creativity within preference frameworks is a promising direction for advancing the creative capabilities of LLMs without compromising output quality.

摘要

虽然大型语言模型(LLMs)在自然语言生成任务中展现出卓越性能,但其生成真正创造性内容——以新颖性、多样性、惊喜度和质量为特征——的能力仍然有限。现有增强LLM创造力的方法通常狭隘地关注多样性或特定任务,未能以可泛化的方式应对创造力的多维度特性。本研究提出创造性偏好优化(CrPO),这是一种新颖的对齐方法,以模块化方式将多维度创造力信号注入偏好优化目标。我们使用CrPO和MuCE(一个涵盖超过200,000条人工生成响应及30多项心理学创造力评估评分的大规模人类偏好数据集)训练并评估了多个创造力增强模型。我们的模型在自动化和人工评估中均优于包括GPT-4o在内的强基线,能在保持高质量输出的同时生成更具新颖性、多样性和惊喜度的内容。在NoveltyBench上的进一步评估验证了该方法的泛化性。这些结果表明,在偏好框架内直接优化创造力是提升LLM创造性能力且不牺牲输出质量的有效方向。


ModRWKV: Transformer Multimodality in Linear Time

Abstract

arXiv:2505.14505v1 Announce Type: cross Abstract: Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV-a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone-which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model's ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.

摘要

当前,大多数多模态研究都基于具有二次复杂度Transformer架构的大语言模型(LLMs)。虽然RNN等线性模型推理成本较低,但其应用主要局限于纯文本模态。本研究探索了现代RNN架构在多模态语境中的能力。我们提出ModRWKV——一个基于RWKV7架构作为LLM主干的可解耦多模态框架,通过动态可适配的异构模态编码器实现多源信息融合。我们为ModRWKV设计了极轻量级的多模态模块,并通过大量实验确定了性能与计算效率达到最佳平衡的配置方案。ModRWKV利用RWKV7 LLM的预训练权重进行初始化,这显著加速了多模态训练过程。不同预训练检查点的对比实验进一步证明,此类初始化对增强模型理解多模态信号的能力具有关键作用。基于大量实验支撑,我们得出结论:现代RNN架构在多模态大语言模型(MLLMs)领域可作为Transformer的有效替代方案。此外,通过系统性探索,我们确定了ModRWKV架构的最佳配置方案。


CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation

Abstract

arXiv:2505.14455v1 Announce Type: cross Abstract: Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework. Diffusion-based language models have emerged as a compelling alternative due to their powerful parallel generation capabilities and inherent editability. However, these models are often constrained by fixed-length generation. A promising direction is to combine the strengths of both paradigms, segmenting sequences into blocks, modeling autoregressive dependencies across blocks while leveraging discrete diffusion to estimate the conditional distribution within each block given the preceding context. Nevertheless, their practical application is often hindered by two key limitations: rigid fixed-length outputs and a lack of flexible control mechanisms. In this work, we address the critical limitations of fixed granularity and weak controllability in current large diffusion language models. We propose CtrlDiff, a dynamic and controllable semi-autoregressive framework that adaptively determines the size of each generation block based on local semantics using reinforcement learning. Furthermore, we introduce a classifier-guided control mechanism tailored to discrete diffusion, which significantly reduces computational overhead while facilitating efficient post-hoc conditioning without retraining. Extensive experiments demonstrate that CtrlDiff sets a new standard among hybrid diffusion models, narrows the performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.

摘要

尽管自回归模型近年来主导了语言建模领域,但学界对突破传统下一词预测框架的替代范式兴趣日增。基于扩散的语言模型因其强大的并行生成能力和内在可编辑性,已成为极具吸引力的替代方案。然而,这类模型常受限于固定长度生成的约束。一个颇具前景的研究方向是融合两种范式的优势:将序列分割为文本块,在块间建立自回归依赖关系,同时利用离散扩散模型在给定上文语境条件下估计块内条件概率分布。但此类模型的实际应用长期受两大关键限制制约:僵化的固定长度输出和灵活控制机制的缺失。本研究针对当前大规扩散语言模型中固定粒度和弱可控性这两大核心缺陷,提出了CtrlDiff——一种基于强化学习、能根据局部语义自适应确定生成块大小的动态可控半自回归框架。我们进一步设计了专用于离散扩散的分类器引导控制机制,在显著降低计算开销的同时,实现了无需重新训练的高效事后条件控制。大量实验表明,CtrlDiff为混合扩散模型树立了新标杆,缩小了与最先进自回归方法的性能差距,并在多样化任务中实现了高效的条件文本生成。


Exploring Graph Representations of Logical Forms for Language Modeling

Abstract

arXiv:2505.14523v1 Announce Type: cross Abstract: We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs pretrained on similar amounts of data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.

摘要

我们提出基于逻辑形式的语言模型(LFLMs),论证其相较于纯文本模型具有更高的数据效率。为此,我们开发了基于图的逻辑分布语义原型(GFoLDS)——一种针对逻辑形式图表示进行预训练的语言模型,作为LFLMs的概念验证。通过GFoLDS实验,我们获得重要证据表明:LFLMs能够利用模型内置的基础语言学知识,快速掌握更复杂的模式。在下游任务中,GFoLDS显著优于基于相似数据量预训练的文本Transformer模型,证明LFLMs所需训练数据量远少于纯文本模型。此外,我们发现该模型性能可能随参数规模和预训练数据量提升而增强,这预示着LFLMs在实际应用中的可行性。


Enhanced Multimodal Aspect-Based Sentiment Analysis by LLM-Generated Rationales

Abstract

arXiv:2505.14499v1 Announce Type: cross Abstract: There has been growing interest in Multimodal Aspect-Based Sentiment Analysis (MABSA) in recent years. Existing methods predominantly rely on pre-trained small language models (SLMs) to collect information related to aspects and sentiments from both image and text, with an aim to align these two modalities. However, small SLMs possess limited capacity and knowledge, often resulting in inaccurate identification of meaning, aspects, sentiments, and their interconnections in textual and visual data. On the other hand, Large language models (LLMs) have shown exceptional capabilities in various tasks by effectively exploring fine-grained information in multimodal data. However, some studies indicate that LLMs still fall short compared to fine-tuned small models in the field of ABSA. Based on these findings, we propose a novel framework, termed LRSA, which combines the decision-making capabilities of SLMs with additional information provided by LLMs for MABSA. Specifically, we inject explanations generated by LLMs as rationales into SLMs and employ a dual cross-attention mechanism for enhancing feature interaction and fusion, thereby augmenting the SLMs' ability to identify aspects and sentiments. We evaluated our method using two baseline models, numerous experiments highlight the superiority of our approach on three widely-used benchmarks, indicating its generalizability and applicability to most pre-trained models for MABSA.

摘要

近年来,多模态方面级情感分析(MABSA)领域的研究兴趣日益增长。现有方法主要依赖预训练的小型语言模型(SLM)从图像和文本中收集与方面和情感相关的信息,旨在实现两种模态的对齐。然而,小型语言模型的能力和知识有限,往往导致对文本和视觉数据中含义、方面、情感及其相互关联的识别不准确。另一方面,大型语言模型(LLM)通过有效探索多模态数据中的细粒度信息,在各种任务中展现出卓越能力。但研究表明,在方面级情感分析领域,LLM仍逊色于经过微调的小型模型。基于这些发现,我们提出了一种名为LRSA的新型框架,该框架将SLM的决策能力与LLM提供的额外信息相结合,用于多模态方面级情感分析。具体而言,我们将LLM生成的解释作为推理依据注入SLM,并采用双重交叉注意力机制来增强特征交互与融合,从而提升SLM识别方面和情感的能力。通过使用两个基线模型进行评估,大量实验证明我们的方法在三个广泛使用的基准数据集上具有优越性,表明其对大多数预训练模型在多模态方面级情感分析任务中具有普适性和适用性。


ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Abstract

arXiv:2505.14468v1 Announce Type: cross Abstract: Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions.

摘要

无服务器计算因其按使用付费的定价模式、细粒度GPU资源利用和快速扩展能力,在大型语言模型(LLM)推理服务中迅速发展。然而,我们的分析表明,当前无服务器架构能有效服务通用LLM,却难以胜任低秩自适应(LoRA)推理,主要存在三个关键缺陷:1)函数间存在大规模参数冗余,99%的权重被不必要地重复存储;2)除LLM加载外还需承担高昂的模型构件加载延迟;3)在服务多个LoRA增强的LLM时资源竞争问题加剧。这些低效性导致GPU资源大量浪费、首令牌生成时间(TTFT)延长及服务成本攀升。我们提出ServerlessLoRA——一种专为高效低成本LoRA推理服务设计的新型无服务器系统。该系统通过跨隔离LoRA函数的安全骨干LLM共享机制消除冗余,采用预加载策略全面载入LoRA构件以降低冷启动延迟,并实施基于竞争感知的批处理与卸载技术来缓解突发负载下的GPU资源冲突。工业级工作负载实验表明,相较最先进的LLM推理方案,ServerlessLoRA能将TTFT缩短86%,服务成本降低89%。


Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models

Abstract

arXiv:2505.14436v1 Announce Type: cross Abstract: Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate \textbf&#123;Alignment&#125; in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called \textbf&#123;LaTen&#125; (\textbf&#123;L&#125;oc\textbf&#123;a&#125;te-\textbf&#123;T&#125;h\textbf&#123;e&#125;n-Alig\textbf&#123;n&#125;) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify \textbf&#123;Neural Incompatibility&#125; as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at https://github.com/Trae1ounG/Neural_Incompatibility.

摘要

大语言模型(LLMs)提供了一个透明的"大脑",其可访问的参数编码了海量知识,这些知识可被分析、定位与迁移。因此,如何超越植根于符号语言的传统知识迁移范式,实现真正的参数化知识迁移(PKT)成为核心研究挑战。值得注意的是,探索通过参数在不同规模LLMs间迁移知识的有效方法,是一个极具价值的研究方向。本文首先论证了参数空间的\textbf&#123;对齐&#125;是实现跨尺度PKT的基本前提。我们将既有研究中的知识迁移重新定义为后对齐PKT(PostPKT),其利用提取参数进行LoRA初始化并需后续微调以实现对齐。为降低进一步微调的成本,我们提出新型前对齐PKT(PrePKT)范式,并设计\textbf&#123;LaTen&#125;解决方案(\textbf&#123;定位&#125;-\textbf&#123;对齐&#125;),仅需数步训练即可实现跨尺度LLMs参数空间对齐而无需后续训练。在四个基准测试上的综合实验表明,PostPKT与PrePKT均难以实现持续稳定的迁移。通过深度分析,我们发现\textbf&#123;神经不相容性&#125;(即不同规模LLMs在行为学与参数结构上的差异)是阻碍有效PKT的根本挑战。这些发现为LLMs的参数架构研究提供了新视角,并为高效PKT的未来研究指明了方向。代码已开源:https://github.com/Trae1ounG/Neural_Incompatibility。


Latent Flow Transformer

Abstract

arXiv:2505.14513v1 Announce Type: cross Abstract: Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in \textit{preserving coupling} by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.

摘要

作为大语言模型(LLM)的标准实现,Transformer通常由数十至数百个离散层组成。尽管增加层数可提升性能,但该方法效率低下的问题日益凸显,尤其是基于扩散和流模型的连续层结构在图像生成领域已展现出显著优势。我们提出潜在流Transformer(LFT),通过流匹配训练将层块替换为单一可学习的传输算子,在保持与原架构兼容性的同时实现显著压缩。此外,针对现有流方法在\textit{保持耦合性}方面的局限,我们提出流游走(FW)算法。在Pythia-410M模型上的实验表明:采用流匹配训练的LFT将24层中的6层压缩后,其性能优于直接跳过2层的方案(语言模型对数KL散度0.407 vs. 0.529);当采用FW训练时,LFT进一步将12层蒸馏为1层并将KL散度降至0.736,显著优于跳过3层的方案(0.932),从而大幅缩小了自回归与基于流的生成范式之间的差距。


Can Large Language Models Really Recognize Your Name?

Abstract

arXiv:2505.14549v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regularly overlook human names even in short text snippets due to ambiguous contexts, which cause the names to be misinterpreted or mishandled. We propose AMBENCH, a benchmark dataset of seemingly ambiguous human names, leveraging the name regularity bias phenomenon, embedded within concise text snippets along with benign prompt injections. Our experiments on modern LLMs tasked to detect PII as well as specialized tools show that recall of ambiguous names drops by 20--40% compared to more recognizable names. Furthermore, ambiguous human names are four times more likely to be ignored in supposedly privacy-preserving summaries generated by LLMs when benign prompt injections are present. These findings highlight the underexplored risks of relying solely on LLMs to safeguard user privacy and underscore the need for a more systematic investigation into their privacy failure modes.

摘要

大型语言模型(LLMs)正日益被用于保护敏感用户数据。然而,当前基于LLM的隐私解决方案假设这些模型能够可靠地检测个人身份信息(PII),尤其是命名实体。本文通过揭示基于LLM的隐私任务中的系统性故障,对这一假设提出了挑战。具体而言,我们证明现代LLMs在短文本片段中经常忽略人名,这是由于模糊语境导致这些名字被误解或处理不当。我们提出了AMBENCH——一个基于名称规律性偏差现象构建的、包含看似模糊人名的基准数据集,这些名字被嵌入简洁文本片段并伴有良性提示注入。我们在现代LLMs及专用工具上进行的实验表明,与更易识别的名字相比,模糊人名的召回率下降了20-40%。此外,当存在良性提示注入时,在LLMs生成的所谓隐私保护摘要中,模糊人名被忽略的可能性是其他名字的四倍。这些发现凸显了仅依赖LLMs保护用户隐私的未充分探索的风险,并强调需要对其隐私失效模式进行更系统性的研究。


KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Abstract

arXiv:2505.14552v1 Announce Type: cross Abstract: Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

摘要

大语言模型(LLM)的最新进展凸显了对更全面评估方法的需求,以准确衡量其推理能力。现有基准测试通常局限于特定领域,因而无法充分捕捉LLM的通用推理潜力。为解决这一局限,我们受KOR-Bench和Gymnasium启发,开发了知识正交推理训练场(KORGym)这一动态评估平台。KORGym提供超过五十种文本或视觉形式的游戏,支持基于强化学习场景的交互式多轮评估。通过该平台,我们对19个LLM和8个VLM进行了广泛实验,揭示了模型家族内部一致的推理模式,并证明了闭源模型的优越性能。进一步分析探讨了模态、推理策略、强化学习技术及响应长度对模型表现的影响。我们期待KORGym能成为推动LLM推理研究、开发适应复杂交互环境的评估方法的重要资源。


Toward Reliable Biomedical Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models

Abstract

arXiv:2505.14599v1 Announce Type: cross Abstract: Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful biomedical hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.

摘要

大型语言模型(LLMs)在生物医学等科学领域展现出显著潜力,尤其在假设生成方面能够通过分析海量文献、识别模式并提出研究方向。然而,核心挑战在于评估生成假设的真实性,因为验证其准确性通常需要耗费大量时间和资源。此外,LLMs的幻觉问题可能导致生成看似合理实则错误的假设,削弱其可靠性。为系统研究这些挑战,我们提出TruthHypo基准用于评估LLMs生成真实生物医学假设的能力,并开发基于知识的幻觉检测器KnowHD以评估假设与现有知识的契合度。实验结果表明,LLMs难以生成真实假设。通过分析推理步骤中的幻觉现象,我们证明KnowHD提供的 groundedness 评分可作为有效指标,从LLMs的多样化输出中筛选真实假设。人工评估进一步验证了KnowHD在识别真实假设和加速科学发现方面的实用性。数据与源代码已开源:https://github.com/Teddy-XiongGZ/TruthHypo。


KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models

Abstract

arXiv:2505.14629v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.

摘要

随着大语言模型(LLMs)的快速发展以及食品数据的日益丰富,利用LLMs提升食品理解的研究逐渐增多。尽管已有多个推荐系统结合了LLMs和知识图谱(KGs),但将食品相关KGs与LLMs整合的研究仍较为有限。我们提出了KERL,一个统一的系统,该系统利用食品KGs和LLMs提供个性化的食品推荐,并生成包含微量营养信息的食谱。给定一个自然语言问题,KERL会提取实体并从KG中检索子图,随后将这些子图作为上下文输入LLM以筛选满足约束条件的食谱。接着,我们的系统会为每个食谱生成烹饪步骤和营养信息。为了评估该方法,我们还通过整理与食谱相关的问题并结合约束条件和个人偏好,开发了一个基准数据集。通过大量实验,我们证明所提出的基于KG增强的LLM显著优于现有方法,为食品推荐、食谱生成和营养分析提供了一个完整且一致的解决方案。我们的代码和基准数据集已公开发布于https://github.com/mohbattharani/KERL。


EmoGist: Efficient In-Context Learning for Visual Emotion Understanding

Abstract

arXiv:2505.14660v1 Announce Type: cross Abstract: In this paper, we introduce EmoGist, a training-free, in-context learning method for performing visual emotion classification with LVLMs. The key intuition of our approach is that context-dependent definition of emotion labels could allow more accurate predictions of emotions, as the ways in which emotions manifest within images are highly context dependent and nuanced. EmoGist pre-generates multiple explanations of emotion labels, by analyzing the clusters of example images belonging to each category. At test time, we retrieve a version of explanation based on embedding similarity, and feed it to a fast VLM for classification. Through our experiments, we show that EmoGist allows up to 13 points improvement in micro F1 scores with the multi-label Memotion dataset, and up to 8 points in macro F1 in the multi-class FI dataset.

摘要

本文提出EmoGist,一种基于上下文学习的免训练视觉情感分类方法,适用于大型视觉语言模型(LVLMs)。该方法的核心思想是:情感标签的上下文相关定义能够实现更准确的情感预测,因为图像中情感表现具有高度语境依赖性和细微差异。EmoGist通过分析每个情感类别的示例图像聚类,预先生成多种情感标签解释。在测试阶段,我们基于嵌入相似度检索相应解释,并将其输入快速视觉语言模型进行分类。实验表明,在多标签Memotion数据集上,EmoGist使微平均F1分数最高提升13个百分点;在多分类FI数据集上,宏平均F1分数最高提升8个百分点。


CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation

Abstract

arXiv:2505.14646v1 Announce Type: cross Abstract: Efficient creation of accurate and editable 3D CAD models is critical in engineering design, significantly impacting cost and time-to-market in product innovation. Current manual workflows remain highly time-consuming and demand extensive user expertise. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, inability to generalize to real-world images, and low output accuracy. This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input. Leveraging a novel dataset that we created--GenCAD-Code, consisting of over 163k CAD-model image and code pairs--CAD-Coder outperforms state-of-the-art VLM baselines such as GPT-4.5 and Qwen2.5-VL-72B, achieving a 100% valid syntax rate and the highest accuracy in 3D solid similarity. Notably, our VLM demonstrates some signs of generalizability, successfully generating CAD code from real-world images and executing CAD operations unseen during fine-tuning. The performance and adaptability of CAD-Coder highlights the potential of VLMs fine-tuned on code to streamline CAD workflows for engineers and designers. CAD-Coder is publicly available at: https://github.com/anniedoris/CAD-Coder.

摘要

高效创建精确且可编辑的3D CAD模型对工程设计至关重要,这将显著影响产品创新的成本与上市时间。当前人工工作流程仍极为耗时且需要用户具备深厚专业知识。尽管人工智能驱动的CAD生成技术近期发展显示出潜力,但现有模型仍受限于CAD操作表征不完整、难以泛化至真实世界图像以及输出精度较低等问题。本文提出CAD-Coder——一个经过显式微调的开源视觉语言模型(VLM),可直接根据视觉输入生成可编辑的CAD代码(CadQuery Python)。通过利用我们创建的新型数据集GenCAD-Code(包含超过16.3万组CAD模型图像与代码对),CAD-Coder在生成语法有效性(100%通过率)和三维实体相似度准确性方面均超越GPT-4.5、Qwen2.5-VL-72B等最先进的VLM基线模型。值得注意的是,该VLM展现出一定泛化能力,能成功根据真实世界图像生成CAD代码,并执行微调阶段未见的CAD操作。CAD-Coder的性能与适应性凸显了基于代码微调的VLM在简化工程师和设计师CAD工作流程方面的潜力。CAD-Coder已开源发布:https://github.com/anniedoris/CAD-Coder。


Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)

Abstract

arXiv:2505.14608v1 Announce Type: cross Abstract: Despite considerable progress in the development of machine-text detectors, it has been suggested that the problem is inherently hard, and therefore, that stakeholders should proceed under the assumption that machine-generated text cannot be reliably detected as such. We examine a recent such claim by Nicks et al. (2024) regarding the ease with which language models can be optimized to degrade the performance of machine-text detectors, including detectors not specifically optimized against. We identify a feature space\unicode&#123;x2013&#125;the stylistic feature space\unicode&#123;x2013&#125;that is robust to such optimization, and show that it may be used to reliably detect samples from language models optimized to prevent detection. Furthermore, we show that even when models are explicitly optimized against stylistic detectors, detection performance remains surprisingly unaffected. We then seek to understand if stylistic detectors are inherently more robust. To study this question, we explore a new paraphrasing approach that simultaneously aims to close the gap between human writing and machine writing in stylistic feature space while avoiding detection using traditional features. We show that when only a single sample is available for detection, this attack is universally effective across all detectors considered, including those that use writing style. However, as the number of samples available for detection grows, the human and machine distributions become distinguishable. This observation encourages us to introduce AURA, a metric that estimates the overlap between human and machine-generated distributions by analyzing how detector performance improves as more samples become available. Overall, our findings underscore previous recommendations to avoid reliance on machine-text detection.

摘要

尽管机器文本检测器的研发已取得显著进展,但有观点认为该问题本质上是困难的,因此利益相关者应基于"机器生成文本无法被可靠检测"的前提开展工作。我们检验了Nicks等人(2024年)的最新主张,该研究认为语言模型可轻易通过优化来降低各类机器文本检测器(包括非针对性优化的检测器)的性能。我们发现了一个对此类优化具有鲁棒性的特征空间——风格特征空间,并证明其可用于可靠检测经过防检测优化的语言模型样本。进一步研究表明,即使模型被明确针对风格检测器进行优化,检测性能仍能保持惊人的稳定性。随后我们探究风格检测器是否具有本质鲁棒性。为此,我们开发了一种新型改写方法,该方法旨在缩小风格特征空间中人类写作与机器写作的差距,同时规避传统特征的检测。实验表明,当仅提供单个检测样本时,该攻击对所有检测器(包括基于写作风格的检测器)均具有普适有效性。但随着检测样本量增加,人类与机器生成的文本分布将变得可区分。这一发现促使我们提出AURA指标,该指标通过分析检测器性能随样本量增加的改善程度,来评估人机文本分布的重叠程度。总体而言,我们的研究结果强化了先前关于避免依赖机器文本检测的建议。


TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Abstract

arXiv:2505.14625v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.

摘要

强化学习(RL)已成为通过奖励信号优化策略来增强大语言模型(LLMs)推理能力的有力工具。然而,RL的成功依赖于验证器提供的奖励可靠性。本文揭示并分析了一个普遍存在的问题——假阴性,即验证器错误拒绝模型输出的正确答案。通过对Big-Math-RL-Verified数据集的深入研究发现,超过38%的模型生成响应存在假阴性问题,验证器未能识别正确答案。我们通过实证与理论分析表明,这些假阴性会剥夺模型获取信息梯度信号的能力并延缓收敛,从而严重损害RL训练效果。为缓解此问题,我们提出轻量级LLM验证器tinyV,该模块动态识别潜在假阴性并恢复有效响应以生成更精确的奖励估计,从而增强现有基于规则的验证方法。在多个数学推理基准测试中,集成TinyV可使通过率最高提升10%,并显著加快模型收敛速度。本研究不仅揭示了解决验证器假阴性问题的重要性,更为基于RL的LLM微调提供了实用改进方案。代码已开源:https://github.com/uw-nsl/TinyV。


Beyond Words: Multimodal LLM Knows When to Speak

Abstract

arXiv:2505.14654v1 Announce Type: cross Abstract: While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

摘要

尽管基于大语言模型(LLM)的聊天机器人在生成连贯且上下文相关的回应方面表现出强大能力,但它们往往难以把握发言时机,特别是在持续对话中需要作出简短及时的反应时。这一局限性主要源于其对文本输入的依赖,缺乏现实人类对话中丰富的多模态情境线索。本研究聚焦于实时预测回应类型,重点关注依赖视觉、听觉和文本等多模态细微信号的简短反应性话语。为此,我们引入了一个从真实对话视频构建的新型多模态数据集,包含时间对齐的视觉、听觉和文本流。该数据集支持对二元互动中回应时机的细粒度建模。基于此数据集,我们提出MM-When2Speak模型——一个基于多模态大语言模型的系统,能自适应整合视觉、听觉和文本上下文来预测回应时机及合适类型。实验表明,MM-When2Speak显著优于当前最先进的单模态及基于LLM的基线模型,在回应时机准确率上较领先商业LLM提升达4倍。这些结果证实了多模态输入对于产生及时、自然且引人入胜的对话AI的关键作用。


Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

Abstract

arXiv:2505.14684v1 Announce Type: cross Abstract: Large language models (LLMs) have achieved remarkable progress on mathemati-cal tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.

摘要

大语言模型(LLMs)通过思维链(CoT)推理在数学任务上取得了显著进展。然而,现有数学CoT数据集常因专家省略中间步骤而出现思维跳跃问题,这对模型学习与泛化产生负面影响。我们提出CoT思维跳跃桥接任务,旨在自动检测跳跃并生成缺失的中间推理步骤,以恢复CoT的完整性与连贯性。为此,我们基于结构化ScaleQuestMath数据集构建了专用训练集ScaleQM+,并训练CoT-Bridge模型来桥接思维跳跃。通过在数学推理基准上的全面实验,我们证明基于桥接数据集微调的模型始终优于原始数据集训练的模型,在NuminaMath上最高提升+5.87%。该方法有效提升了蒸馏数据性能(+3.02%),为强化学习提供更优起点(+3.1%),可作为即插即用模块兼容现有优化技术。此外,CoT-Bridge在领域外逻辑推理任务中展现出更好的泛化能力,证实提升推理完整性具有广泛适用性。


APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

Abstract

arXiv:2406.14449v2 Announce Type: replace Abstract: Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs.

摘要

大语言模型(LLMs)显著提升了信息检索(IR)各模块(如重排序)的性能。尽管表现优异,当前基于LLM的零样本相关性排序仍高度依赖人工提示工程。现有自动提示工程算法主要集中于语言建模和分类任务,而对IR领域(尤其是重排序)的研究尚未充分展开。由于相关性排序需将查询与长文本对整合输入,其排序复杂度远超分类任务,直接应用现有提示工程算法具有挑战性。为减少人工干预并释放提示优化在重排序中的潜力,我们提出新型自动提示工程算法APEER。该算法通过反馈与偏好优化迭代生成精细化提示。基于四种LLM和十组数据集的实验表明,APEER相较现有最先进(SoTA)人工提示实现了显著性能提升。此外,我们发现APEER生成的提示在不同任务和LLM间展现出更优的迁移性。


Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment

Abstract

arXiv:2410.09083v2 Announce Type: replace Abstract: This paper presents a method to analyze the inference patterns used by Large Language Models (LLMs) for judgment in a case study on legal LLMs, so as to identify potential incorrect representations of the LLM, according to human domain knowledge. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs. To this end, we quantify the interactions between input phrases used by the LLM as primitive inference patterns, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed inference patterns of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.

摘要

本文提出一种分析方法,用于研究大型语言模型(LLMs)在法律案例判断中的推理模式,从而根据人类领域知识识别模型可能存在的错误表征。与传统基于语言生成结果的评估不同,我们主张对模型在看似正确输出背后所采用的详细推理模式进行正确性评估。为此,我们将模型使用的输入短语间交互量化为基本推理模式,因为最新理论成果已证明基于交互的解释方法具有若干数学上的忠实性保证。我们设计了一套指标来评估LLMs的详细推理模式。实验表明,即使语言生成结果看似正确,模型用于法律判断的推理模式中仍有相当部分可能体现误导性或无关的逻辑。


Abstract

arXiv:2505.14680v1 Announce Type: cross Abstract: Generative AI search is reshaping information retrieval by offering end-to-end answers to complex queries, reducing users' reliance on manually browsing and summarizing multiple web pages. However, while this paradigm enhances convenience, it disrupts the feedback-driven improvement loop that has historically powered the evolution of traditional Web search. Web search can continuously improve their ranking models by collecting large-scale, fine-grained user feedback (e.g., clicks, dwell time) at the document level. In contrast, generative AI search operates through a much longer search pipeline, spanning query decomposition, document retrieval, and answer generation, yet typically receives only coarse-grained feedback on the final answer. This introduces a feedback loop disconnect, where user feedback for the final output cannot be effectively mapped back to specific system components, making it difficult to improve each intermediate stage and sustain the feedback loop. In this paper, we envision NExT-Search, a next-generation paradigm designed to reintroduce fine-grained, process-level feedback into generative AI search. NExT-Search integrates two complementary modes: User Debug Mode, which allows engaged users to intervene at key stages; and Shadow User Mode, where a personalized user agent simulates user preferences and provides AI-assisted feedback for less interactive users. Furthermore, we envision how these feedback signals can be leveraged through online adaptation, which refines current search outputs in real-time, and offline update, which aggregates interaction logs to periodically fine-tune query decomposition, retrieval, and generation models. By restoring human control over key stages of the generative AI search pipeline, we believe NExT-Search offers a promising direction for building feedback-rich AI search systems that can evolve continuously alongside human feedback.

摘要

生成式AI搜索正通过提供复杂查询的端到端答案重塑信息检索领域,减少了用户手动浏览和汇总多个网页的需求。然而,这种范式在提升便利性的同时,也破坏了传统网络搜索赖以发展的反馈驱动改进机制。传统网络搜索能够通过收集文档层面的大规模细粒度用户反馈(如点击率、停留时长)持续优化排序模型,而生成式AI搜索则需经历查询分解、文档检索和答案生成等更长的搜索流程,却通常仅能获得最终答案的粗粒度反馈。这种反馈循环的断裂使得最终输出的用户反馈难以有效映射回特定系统组件,导致各中间阶段无法优化、反馈循环难以为继。本文提出NExT-Search这一新一代范式,旨在将细粒度的过程级反馈重新引入生成式AI搜索。该框架整合两种互补模式:用户调试模式允许深度用户介入关键环节;影子用户模式则通过个性化代理模拟用户偏好,为非交互型用户提供AI辅助反馈。进一步地,我们提出可通过在线自适应(实时优化当前搜索结果)和离线更新(聚合交互日志周期性微调查询分解、检索与生成模型)来利用这些反馈信号。通过恢复人类对生成式AI搜索关键环节的掌控,NExT-Search为构建能随人类反馈持续进化的、富含反馈机制的AI搜索系统提供了可行方向。


IoT-LLM: Enhancing Real-World IoT Task Reasoning with Large Language Models

Abstract

arXiv:2410.02429v3 Announce Type: replace Abstract: Large Language Models (LLMs) excel in textual and visual tasks but often produce outputs that defy physical laws when dealing with physical-world reasoning tasks. Inspired by human cognition, where perception is fundamental to reasoning, we explore augmenting LLMs with enhanced perception abilities using Internet of Things (IoT) sensor data and pertinent knowledge for IoT-sensory task reasoning in the physical world. In this work, we systematically study LLMs' capability to address real-world IoT-sensory tasks by augmenting their perception and knowledge base, and then propose a unified framework, IoT-LLM, to enhance such capability. In IoT-LLM, we customize three steps for LLMs: preprocessing IoT data into formats amenable to LLMs, expanding their understanding via IoT-oriented retrieval-augmented generation based on in-context learning and activating their commonsense knowledge through chain-of-thought prompting and specialized role definitions. We design a new benchmark comprising five real-world tasks with varying data types and reasoning complexities to evaluate the performance of IoT-LLM. Experimental results on six LLMs reveal that IoT-LLM significantly improves the performance of IoT-sensory task reasoning of LLMs, with models like GPT-4o-mini showing a 49.4% average improvement over previous methods.

摘要

大语言模型(LLMs)在文本和视觉任务中表现卓越,但在处理物理世界推理任务时,常产生违背物理定律的输出。受人类认知中感知是推理基础的启发,我们探索通过物联网(IoT)传感器数据和相关知识增强LLMs的感知能力,以支持物理世界中的IoT感知任务推理。本研究系统性地考察了LLMs通过增强感知和知识库解决真实世界IoT感知任务的能力,并提出统一框架IoT-LLM来提升该能力。在IoT-LLM中,我们为LLMs定制了三个步骤:将IoT数据预处理为适合LLMs的格式、基于上下文学习的IoT导向检索增强生成扩展其理解能力,以及通过思维链提示和特定角色定义激活其常识知识。我们设计了一个包含五种数据类型和推理复杂度各异的真实世界任务的新基准,用于评估IoT-LLM的性能。在六个LLMs上的实验结果表明,IoT-LLM显著提升了LLMs的IoT感知任务推理性能,其中GPT-4o-mini等模型相比先前方法平均提升了49.4%。


Attention Mechanism for LLM-based Agents Dynamic Diffusion under Information Asymmetry

Abstract

arXiv:2502.13160v3 Announce Type: replace Abstract: Large language models have been used to simulate human society using multi-agent systems. Most current social simulation research emphasizes interactive behaviors in fixed environments, ignoring information opacity, relationship variability, and diffusion diversity. In this paper, we first propose a general framework for exploring multi-agent information diffusion. We identified LLMs' deficiency in the perception and utilization of social relationships, as well as diverse actions. Then, we designed a dynamic attention mechanism to help agents allocate attention to different information, addressing the limitations of the LLM attention mechanism. Agents start by responding to external information stimuli within a five-agent group, increasing group size and forming information circles while developing relationships and sharing information. Additionally, we explore the information diffusion features in the asymmetric open environment by observing the evolution of information gaps, diffusion patterns, and the accumulation of social capital, which are closely linked to psychological, sociological, and communication theories.

摘要

大型语言模型已被用于通过多智能体系统模拟人类社会。当前大多数社会模拟研究侧重于固定环境中的交互行为,忽视了信息不透明性、关系可变性以及扩散多样性。本文首先提出一个探索多智能体信息扩散的通用框架,发现大语言模型在社交关系感知与利用以及多样化行为方面存在不足。为此,我们设计了一种动态注意力机制,帮助智能体对不同信息分配注意力,以解决大语言模型注意力机制的局限性。智能体首先在五智能体群组中响应外部信息刺激,通过扩大群组规模形成信息圈,同时发展关系网络并共享信息。此外,通过观察信息鸿沟的演变、扩散模式以及社会资本的积累——这些与心理学、社会学和传播理论紧密相关的现象,我们探索了非对称开放环境中的信息扩散特征。


Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

Abstract

arXiv:2504.05047v2 Announce Type: replace Abstract: Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs). Despite improvements in reasoning, the approach introduces substantial computational overhead resulting from iterative agent interactions. Furthermore, engaging in unnecessary debates increases the risk of generating erroneous responses. To address these challenges, we propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates debate based on the confidence score of the agent's initial response. Debate is activated only for queries requiring further deliberation, during which agents refine their outputs by referencing peer responses and associated confidence scores. Evaluations on benchmarks show that DOWN improves efficiency by up to six times while preserving or even outperforming the performance of existing methods. Further analysis indicates that DOWN effectively mitigates the risk of error propagation stemming from the unnecessary debate process. These findings demonstrate the effectiveness of our approach in delivering high-performance LLM solutions at a lower computational cost.

摘要

多智能体协作已成为增强大语言模型(LLM)推理能力的重要框架。尽管该方法提升了推理性能,但迭代式的智能体交互会带来显著的计算开销。此外,不必要的辩论会增加生成错误回答的风险。为解决这些问题,我们提出"仅在必要时辩论"(DOWN)——一种基于智能体初始回答置信度分数选择性激活辩论的自适应多智能体辩论框架。该框架仅对需要进一步审议的查询激活辩论,期间智能体通过参考同伴回答及相关置信度分数来优化输出。基准测试表明,DOWN在保持甚至超越现有方法性能的同时,将效率提升至多六倍。进一步分析显示,DOWN能有效减少由不必要辩论过程导致的错误传播风险。这些发现证明我们的方法能够以更低计算成本实现高性能LLM解决方案。


From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios

Abstract

arXiv:2502.02145v2 Announce Type: replace Abstract: Ensuring the safety of autonomous vehicles requires virtual scenario-based testing, which depends on the robust evaluation and generation of safety-critical scenarios. So far, researchers have used scenario-based testing frameworks that rely heavily on handcrafted scenarios as safety metrics. To reduce the effort of human interpretation and overcome the limited scalability of these approaches, we combine Large Language Models (LLMs) with structured scenario parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios. We introduce Cartesian and Ego-centric prompt strategies for scenario evaluation, and an adversarial generation module that modifies trajectories of risk-inducing vehicles (ego-attackers) to create critical scenarios. We validate our approach using a 2D simulation framework and multiple pre-trained LLMs. The results show that the evaluation module effectively detects collision scenarios and infers scenario safety. Meanwhile, the new generation module identifies high-risk agents and synthesizes realistic, safety-critical scenarios. We conclude that an LLM equipped with domain-informed prompting techniques can effectively evaluate and generate safety-critical driving scenarios, reducing dependence on handcrafted metrics. We release our open-source code and scenarios at: https://github.com/TUM-AVS/From-Words-to-Collisions.

摘要

确保自动驾驶车辆的安全性需要基于虚拟场景的测试,这依赖于对安全关键场景的稳健评估与生成。目前,研究者主要依赖手工构建场景作为安全指标的测试框架。为减少人工干预成本并克服此类方法可扩展性不足的问题,我们结合大型语言模型(LLMs)与结构化场景解析及提示工程技术,实现安全关键驾驶场景的自动评估与生成。我们提出了笛卡尔坐标系和自我中心视角两种提示策略用于场景评估,并开发了一个对抗生成模块,通过修改风险诱发车辆(自我攻击者)的轨迹来创建关键场景。采用二维仿真框架和多种预训练LLMs进行验证,结果表明:评估模块能有效检测碰撞场景并推断场景安全性;同时,新一代模块可识别高风险智能体并合成逼真的安全关键场景。研究表明,配备领域知识提示技术的LLM能有效评估和生成安全关键驾驶场景,降低对手工指标的依赖。开源代码与场景发布于:https://github.com/TUM-AVS/From-Words-to-Collisions。


ProcessBench: Identifying Process Errors in Mathematical Reasoning

Abstract

arXiv:2412.06559v3 Announce Type: replace Abstract: As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.

摘要

由于语言模型在解决数学问题时经常出错,自动识别推理过程中的错误对其可扩展监督变得愈发重要。本文提出ProcessBench,用于评估识别数学推理中错误步骤的能力。该基准包含3,400个测试案例,主要针对竞赛和奥林匹克级别的数学问题。每个测试案例包含由专家标注错误位置的逐步解答,要求模型识别最早出现错误的步骤或判定所有步骤正确。我们对ProcessBench进行了广泛评估,涉及两类模型:过程奖励模型(PRMs)和评判模型(后者通过提示通用语言模型逐步分析解答)。主要发现如下:(1)现有PRMs通常难以推广至GSM8K和MATH数据集之外的更具挑战性的数学问题,其表现既逊于评判模型(即经提示的通用语言模型),也不及我们在PRM800K数据集上直接微调的PRM;(2)最佳开源模型QwQ-32B-Preview展现出与专有模型GPT-4o相当的评判能力,但仍落后于专攻推理的o1-mini模型。我们希望ProcessBench能推动推理过程评估的研究,为语言模型的可扩展监督铺平道路。


Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning

Abstract

arXiv:2403.10056v2 Announce Type: replace-cross Abstract: Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.

摘要

大型语言模型(LLMs)的指令微调可驱动其在特定下游任务中生成符合人类目标的结果。然而,持续指令微调(CIT)过程可能引发灾难性遗忘(CF)问题,导致已习得能力退化。现有方法通常通过修改模型或重放数据来缓解CF问题,但这类方法可能仅记住指令的表层模式,在保留任务上表现混乱。本文提出一种基于关键部分信息增益(KPIG)的新型持续指令微调方法。该方法通过计算掩码部分的信息增益来动态重放数据并优化训练目标,使LLMs能够捕捉与正确响应相关的任务感知信息,缓解对指令中通用描述的过拟合。此外,我们提出P-score和V-score两项指标,用于衡量LLMs的泛化能力和指令遵循能力。实验表明,本方法在已见任务和保留任务上均取得优越性能。


KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Abstract

arXiv:2504.15364v3 Announce Type: replace Abstract: We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget (\sim23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

摘要

我们研究发现,在大语言模型推理过程中具有几何区分度的键往往具有较高的注意力分数。基于这一现象,我们提出了KeyDiff——一种仅基于键相似性的免训练KV缓存淘汰方法。与其它KV缓存淘汰方法不同,KeyDiff能在严格资源限制下处理任意长提示词,并高效生成响应。我们通过建立键多样性与注意力分数之间的关联,为KeyDiff提供了理论基础。这些结果表明KeyDiff能有效识别需要保留的最重要标记。值得注意的是,KeyDiff不依赖注意力分数,因此可采用FlashAttention等优化注意力机制。在严格内存限制下,我们在Llama和Qwen模型系列上验证了KeyDiff的有效性:Llama 3.1-8B和Llama 3.2-3B在LongBench基准测试中,使用8K缓存预算(约减少23% KV缓存)时与非淘汰基线相比性能差距小于0.04%。在Math500推理基准测试中,Deepseek-R1-Distill-Llama-8B模型保持了接近基线的性能,且与其他标记淘汰方法相比,端到端推理延迟最高降低30%。


OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

Abstract

arXiv:2409.13652v3 Announce Type: replace-cross Abstract: The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that utilizes the second moment information in the input embeddings to decompose the model weights into a sum of sparse and low-rank matrices. Without any retraining, OATS achieves state-of-the-art performance when compressing models by up to 60%60\% on large language models such as Llama-3 and Phi-3 and vision transformers such as ViT and DINOv2 while delivering up to 1.37×1.37\times the CPU acceleration versus a model that was comparably pruned.

摘要

近年来向大规模基础模型的范式转变开启了深度学习的新纪元,尽管在实践中取得了巨大成功,但也面临着高昂内存消耗和计算成本的问题。为缓解这些问题,学术界集中研究了无需昂贵重新训练的事后神经网络剪枝技术。尽管已取得显著进展,但现有方法在压缩率提高时往往伴随模型性能的持续下降。本文提出了一种名为OATS的新型大模型压缩方法,该方法利用输入嵌入中的二阶矩信息将模型权重分解为稀疏矩阵和低秩矩阵之和。在不进行任何重新训练的情况下,OATS在Llama-3、Phi-3等大型语言模型以及ViT、DINOv2等视觉Transformer上实现了高达60%的压缩率,同时相较于同类剪枝模型可获得1.37倍的CPU加速效果,其性能达到了当前最先进水平。


PersonaGym: Evaluating Persona Agents and LLMs

Abstract

arXiv:2407.18416v4 Announce Type: replace-cross Abstract: Persona agents, which are LLM agents conditioned to act according to an assigned persona, enable contextually rich and user aligned interactions across domains like education and healthcare. However, evaluating how faithfully these agents adhere to their personas remains a significant challenge, particularly in free-form settings that demand consistency across diverse, persona-relevant environments. We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory that enables comprehensive large-scale evaluation. Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities. For example, GPT-4.1 had the exact same PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed source model. Importantly, increased model size and complexity do not necessarily enhance persona agent capabilities, underscoring the need for algorithmic and architectural innovation toward faithful, performant persona agents.

摘要

角色代理(即根据指定角色设定调整行为的大语言模型代理)能够在教育和医疗等领域实现情境丰富且贴合用户需求的交互。然而,如何准确评估这些代理对角色设定的遵循程度仍存在重大挑战,尤其是在需要跨多样角色相关环境保持一致性的自由交互场景中。我们提出PersonaGym——首个动态角色代理评估框架,以及基于决策理论、与人类评估对齐的自动化指标PersonaScore,该体系支持大规模综合评估。通过对10个主流大语言模型在200个角色设定和10,000个问题上的测试,我们发现存在显著改进空间:例如GPT-4.1与LLaMA-3-8b的PersonaScore得分完全相同,尽管前者是更新且更先进的闭源模型。值得注意的是,模型规模和复杂度的提升并不必然增强角色代理能力,这凸显了需要通过算法和架构创新来实现忠实且高效的角色代理。


Exploring Social Media Image Categorization Using Large Models with Different Adaptation Methods: A Case Study on Cultural Nature's Contributions to People

Abstract

arXiv:2410.00275v3 Announce Type: replace-cross Abstract: Social media images provide valuable insights for modeling, mapping, and understanding human interactions with natural and cultural heritage. However, categorizing these images into semantically meaningful groups remains highly complex due to the vast diversity and heterogeneity of their visual content as they contain an open-world human and nature elements. This challenge becomes greater when categories involve abstract concepts and lack consistent visual patterns. Related studies involve human supervision in the categorization process and the lack of public benchmark datasets make comparisons between these works unfeasible. On the other hand, the continuous advances in large models, including Large Language Models (LLMs), Large Visual Models (LVMs), and Large Visual Language Models (LVLMs), provide a large space of unexplored solutions. In this work 1) we introduce FLIPS a dataset of Flickr images that capture the interaction between human and nature, and 2) evaluate various solutions based on different types and combinations of large models using various adaptation methods. We assess and report their performance in terms of cost, productivity, scalability, and result quality to address the challenges of social media image categorization.

摘要

社交媒体图像为建模、制图及理解人类与自然和文化遗产的互动提供了宝贵资源。然而,由于这些图像包含开放世界的人类与自然元素,其视觉内容具有高度多样性和异质性,将其分类为具有语义意义的组别仍极具挑战性。当涉及抽象概念且缺乏一致视觉模式的类别时,这一挑战尤为显著。现有研究多在分类过程中依赖人工监督,且缺乏公开基准数据集,导致无法有效比较不同研究成果。另一方面,大型模型(包括大语言模型、大视觉模型及大视觉语言模型)的持续进展为解决该问题提供了广阔的探索空间。本研究:1)提出了FLIPS数据集,包含反映人与自然互动的Flickr图像;2)采用多种适配方法评估了基于不同类型大模型组合的解决方案。我们从成本、效率、可扩展性和结果质量等维度系统评估并报告了这些方案在社交媒体图像分类挑战中的表现。


Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review

Abstract

arXiv:2410.03663v4 Announce Type: replace-cross Abstract: While reasoning capabilities typically emerge in large language models (LLMs) with tens of billions of parameters, recent research focuses on improving smaller open-source models through knowledge distillation (KD) from commercial LLMs. However, many of these studies rely solely on responses from a single LLM as the gold rationale, unlike the natural human learning process, which involves understanding both the correct answers and the reasons behind mistakes. In this paper, we introduce a novel Fault-Aware DistIllation via Peer-Review (FAIR) approach: 1) instead of merely obtaining rationales from teachers, our method asks teachers to identify and explain the student's mistakes, providing customized instruction learning data; 2) we design a simulated peer-review process between teacher LLMs, and selects only the generated rationales above the acceptance threshold, which reduces the chance of teachers guessing correctly with flawed rationale, improving instructional data quality. Comprehensive experiments and analysis on mathematical, commonsense, and logical reasoning tasks demonstrate the effectiveness of our method. Our code is available at https://github.com/zhuochunli/Learn-from-Committee.

摘要

虽然推理能力通常出现在具有数百亿参数的大型语言模型(LLMs)中,但近期研究致力于通过从商用LLMs进行知识蒸馏(KD)来改进较小的开源模型。然而,与人类自然学习过程不同——后者需要同时理解正确答案和错误背后的原因——许多研究仅依赖单一LLM的响应作为黄金依据。本文提出了一种新颖的'基于同行评审的容错蒸馏'(FAIR)方法:1)不同于仅从教师模型获取依据,我们的方法要求教师模型识别并解释学生的错误,从而提供定制化的指令学习数据;2)我们设计了教师LLM之间的模拟同行评审流程,仅筛选超过接受阈值的生成依据,这降低了教师模型通过错误依据猜测正确的可能性,提升了教学数据质量。在数学、常识和逻辑推理任务上的全面实验与分析证明了本方法的有效性。代码发布于https://github.com/zhuochunli/Learn-from-Committee。


Automating Intervention Discovery from Scientific Literature: A Progressive Ontology Prompting and Dual-LLM Framework

Abstract

arXiv:2409.00054v2 Announce Type: replace-cross Abstract: Identifying effective interventions from the scientific literature is challenging due to the high volume of publications, specialized terminology, and inconsistent reporting formats, making manual curation laborious and prone to oversight. To address this challenge, this paper proposes a novel framework leveraging large language models (LLMs), which integrates a progressive ontology prompting (POP) algorithm with a dual-agent system, named LLM-Duo. On the one hand, the POP algorithm conducts a prioritized breadth-first search (BFS) across a predefined ontology, generating structured prompt templates and action sequences to guide the automatic annotation process. On the other hand, the LLM-Duo system features two specialized LLM agents, an explorer and an evaluator, working collaboratively and adversarially to continuously refine annotation quality. We showcase the real-world applicability of our framework through a case study focused on speech-language intervention discovery. Experimental results show that our approach surpasses advanced baselines, achieving more accurate and comprehensive annotations through a fully automated process. Our approach successfully identified 2,421 interventions from a corpus of 64,177 research articles in the speech-language pathology domain, culminating in the creation of a publicly accessible intervention knowledge base with great potential to benefit the speech-language pathology community.

摘要

由于科学文献数量庞大、术语专业且报告格式不一致,从海量出版物中识别有效干预措施具有挑战性,这使得人工整理工作繁重且容易遗漏。为解决这一问题,本文提出了一种基于大语言模型(LLM)的新型框架LLM-Duo,该框架将渐进式本体提示(POP)算法与双智能体系统相结合。一方面,POP算法通过预定义本体进行优先级广度优先搜索(BFS),生成结构化提示模板和动作序列以指导自动标注流程;另一方面,LLM-Duo系统包含探索者和评估者两个专用LLM智能体,通过协作对抗机制持续优化标注质量。我们以言语-语言干预措施发现为案例验证了该框架的实际适用性。实验结果表明,本方法在完全自动化流程中实现了比先进基线模型更准确、更全面的标注效果。该方法成功从言语病理学领域的64,177篇研究文章中识别出2,421种干预措施,最终构建了一个具有重要应用价值的公开干预措施知识库,有望为言语病理学界带来显著效益。


Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework

Abstract

arXiv:2409.04744v3 Announce Type: replace-cross Abstract: The inherent uncertainty in the environmental transition model of Reinforcement Learning (RL) necessitates a delicate balance between exploration and exploitation. This balance is crucial for optimizing computational resources to accurately estimate expected rewards for the agent. In scenarios with sparse rewards, such as robotic control systems, achieving this balance is particularly challenging. However, given that many environments possess extensive prior knowledge, learning from the ground up in such contexts may be redundant. To address this issue, we propose Language Model Guided reward Tuning (LMGT), a novel, sample-efficient framework. LMGT leverages the comprehensive prior knowledge embedded in Large Language Models (LLMs) and their proficiency in processing non-standard data forms, such as wiki tutorials. By utilizing LLM-guided reward shifts, LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior and enhancing sample efficiency. We have rigorously evaluated LMGT across various RL tasks and evaluated it in the embodied robotic environment Housekeep. Our results demonstrate that LMGT consistently outperforms baseline methods. Furthermore, the findings suggest that our framework can substantially reduce the computational resources required during the RL training phase.

摘要

强化学习(RL)环境转移模型固有的不确定性要求对探索与利用进行精细平衡,这对优化计算资源以准确估计智能体的预期回报至关重要。在机器人控制系统等稀疏奖励场景中,实现这种平衡尤为困难。然而,鉴于许多环境具有丰富的先验知识,在此类场景中从零开始学习可能存在冗余。为此,我们提出了一种新型高样本效率框架——语言模型引导奖励调优(LMGT)。该框架充分利用大型语言模型(LLMs)中嵌入的全面先验知识及其处理非标准数据(如维基教程)的能力,通过LLM引导的奖励偏移机制,有效平衡探索与利用,从而指导智能体的探索行为并提升样本效率。我们在多种RL任务中对LMGT进行了严格测试,并在具身机器人环境Housekeep中完成评估。结果表明,LMGT始终优于基线方法。此外,研究发现我们的框架能显著减少RL训练阶段所需的计算资源。


Large Continual Instruction Assistant

Abstract

arXiv:2410.10868v4 Announce Type: replace-cross Abstract: Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https://github.com/JingyangQiao/CoIN.

摘要

持续指令微调(CIT)旨在通过数据逐步指导大模型遵循人类意图数据。研究发现,现有梯度更新方法在CIT过程中会严重破坏模型在先前数据集上的性能。而指数移动平均(EMA)因其追踪历史参数的特性,能够有效缓解遗忘问题。然而,其固定的平衡权重无法适应持续变化的数据集,导致可塑性与稳定性失衡。本文提出通用持续指令微调框架以解决该挑战。从权衡前提与EMA更新出发,我们建立了可塑性与稳定性的理想条件。基于损失函数的泰勒展开,发现最优平衡权重可通过梯度和学习参数自动确定。因此,我们提出稳定-可塑性平衡系数以避免知识干扰。根据指令的语义相似度,可判定是否重训练或扩展训练参数,并为测试实例分配最适配参数。在多组持续指令微调基准测试中,本方法不仅增强了抗遗忘能力,还显著提升了整体持续微调性能。代码已开源:https://github.com/JingyangQiao/CoIN。


RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

Abstract

arXiv:2410.11348v3 Announce Type: replace-cross Abstract: Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

摘要

在对齐或评估大语言模型(LLM)时,奖励模型常被用作人类偏好的代理。然而,奖励模型是黑箱系统,其具体奖励机制往往不明确。本文提出基于重写的属性处理估计器(RATE),作为测量奖励模型对响应高层次属性(如情感、帮助性或复杂性)敏感度的有效方法。关键在于,RATE测量的是属性对奖励的因果效应。该方法利用LLM重写响应以生成不完美的反事实样本,进而量化因果效应。核心挑战在于这些重写存在缺陷,可能导致对奖励模型属性敏感度的估计产生显著偏差。RATE的核心思想是通过双重重写调整这种不完美重写效应。我们验证了RATE程序的有效性,并通过实证表明其作为估计量的优越性。


Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation

Abstract

arXiv:2410.14425v2 Announce Type: replace-cross Abstract: Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model's ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct comprehensive experiments on three state-of-the-art large language models and several different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.

摘要

参数高效微调(PEFT)能够弥合大语言模型(LLMs)与下游任务之间的鸿沟。然而研究表明,PEFT易受恶意攻击影响,即使经过参数微调,中毒的大语言模型在输入样本包含预定义触发器时仍能激活内嵌后门。本文提出一种基于特征对齐知识蒸馏的弱到强遗忘算法W2SDefense以防御后门攻击。具体而言,我们首先通过全参数微调训练小规模语言模型作为干净的教师模型,随后该教师模型引导大规模中毒学生模型利用PEFT实现后门遗忘。理论分析表明,W2SDefense能增强学生模型遗忘后门特征的能力,从而阻止后门激活。我们在三种前沿大语言模型和多种后门攻击算法上进行了全面实验,实证结果表明W2SDefense在不影响模型性能的前提下,具有卓越的后门攻击防御表现。


Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Abstract

arXiv:2410.17980v2 Announce Type: replace-cross Abstract: The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings. The method works as follows: For each token before the current, we determine a break point, which represents the proportion of the stick, the weight of the attention, to allocate to the current token. We repeat this on the remaining stick, until all tokens are allocated a weight, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing. We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with 2^&#123;11&#125; context window to perform well at 2^&#123;14&#125; with perplexity improvements.

摘要

传统自注意力机制通常依赖softmax算子,因此需要引入RoPE等位置嵌入或位置偏置来表征词元顺序。但现有方法仍面临长度泛化挑战。本研究在大规模场景下探索了一种基于断棒过程的替代注意力机制。该方法运作原理如下:对于当前词元之前的每个词元,我们确定一个断点,该点代表分配给当前词元的注意力权重比例(即'棒'的截取部分)。在剩余棒段上重复此过程,直至所有词元均获得权重分配,最终生成注意力权重序列。该过程天然包含近因偏置特性,这对语法解析具有语言学意义。我们系统研究了用断棒注意力替代传统softmax注意力的影响,继而论述了数值稳定的断棒注意力实现方案,并改造Flash Attention以适配该机制。当作为现有softmax+RoPE注意力系统的直接替代方案时,发现断棒注意力在长度泛化和下游任务中具有与传统方法相当的竞争力。断棒机制在长度泛化方面表现优异,使用2^&#123;11&#125;上下文窗口训练的模型可在2^&#123;14&#125;窗口下保持良好性能,并实现困惑度提升。


M-RewardBench: Evaluating Reward Models in Multilingual Settings

Abstract

arXiv:2410.15522v3 Announce Type: replace-cross Abstract: Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs' performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.

摘要

奖励模型(RMs)通过将人类反馈整合到语言建模过程中,推动了当前大语言模型(LLMs)的最先进性能。然而,奖励模型主要针对英语进行训练和评估,其在多语言环境中的能力仍鲜有研究。本研究对多种奖励模型在多语言环境下的表现进行了系统性评估。我们首先构建了首个多语言奖励模型评估基准M-RewardBench,包含23种类型多样语言的2870个偏好实例,用于测试奖励模型在对话、安全性、推理和翻译方面的能力。随后,我们在M-RewardBench上对多种奖励模型进行了严格评估,为这些模型在不同语言中的表现提供了新的见解。研究发现,奖励模型在英语与非英语语言之间的性能存在显著差距,且其偏好会因语言不同而发生显著变化。我们还就不同多语言因素如何影响奖励模型性能提出了若干发现。具体而言,研究表明翻译质量的提升会改善奖励模型的性能。同样,我们发现模型在高资源语言中表现更优。本研究公开了M-RewardBench数据集及代码库,以促进对多语言环境下奖励模型评估的深入理解。


Knowledge-Guided Prompt Learning for Request Quality Assurance in Public Code Review

Abstract

arXiv:2410.21673v2 Announce Type: replace-cross Abstract: Public Code Review (PCR) is developed in the Software Question Answering (SQA) community, assisting developers in exploring high-quality and efficient review services. Current methods on PCR mainly focus on the reviewer's perspective, including finding a capable reviewer, predicting comment quality, and recommending/generating review comments. However, it is not well studied that how to satisfy the review necessity requests posted by developers which can increase their visibility, which in turn acts as a prerequisite for better review responses. To this end, we propose Knowledge-guided Prompt learning for Public Code Review (KP-PCR) to achieve developer-based code review request quality assurance (i.e., predicting request necessity and recommending tags subtask). Specifically, we reformulate the two subtasks via 1) text prompt tuning which converts both of them into a Masked Language Model (MLM) by constructing prompt templates using hard prompt; and 2) knowledge and code prefix tuning which introduces knowledge guidance from fine-tuned large language models by soft prompt, and uses program dependence graph to characterize code snippets. Finally, both of the request necessity prediction and tag recommendation subtasks output predicted results through an answer engineering module. In addition, we further analysis the time complexity of our KP-PCR that has lightweight prefix based the operation of introducing knowledge guidance. Experimental results on the PCR dataset for the period 2011-2023 demonstrate that our KP-PCR outperforms baselines by 2.3%-8.4% in the request necessity prediction and by 1.4%-6.9% in the tag recommendation. The code implementation is released at https://github.com/WUT-IDEA/KP-PCR.

摘要

公共代码评审(PCR)技术在软件问答(SQA)社区中发展起来,旨在为开发者提供高质量、高效的评审服务。现有PCR方法主要从评审者视角出发,包括寻找合格评审者、预测评论质量以及推荐/生成评审意见。然而,对于如何满足开发者提出的评审必要性请求(这类请求能提升代码可见度,进而为获得更好评审反馈创造条件)的研究尚不充分。为此,我们提出知识引导的提示学习框架(KP-PCR),实现基于开发者的代码评审请求质量保障(包括请求必要性预测和标签推荐子任务)。具体而言,我们通过以下方式重构这两个子任务:1)文本提示调优,利用硬提示构建模板将二者转化为掩码语言模型(MLM);2)知识与代码前缀调优,通过软提示引入微调大语言模型的知识指导,并采用程序依赖图表征代码片段。最终,请求必要性预测和标签推荐子任务通过答案工程模块输出预测结果。此外,我们进一步分析了KP-PCR的时间复杂度,表明引入知识引导的操作具有轻量级前缀特性。在2011-2023年PCR数据集上的实验表明,KP-PCR在请求必要性预测任务中比基线方法提升2.3%-8.4%,在标签推荐任务中提升1.4%-6.9%。代码实现已发布于https://github.com/WUT-IDEA/KP-PCR。


Can LLMs be Good Graph Judge for Knowledge Graph Construction?

Abstract

arXiv:2411.17388v3 Announce Type: replace-cross Abstract: In real-world scenarios, most of the data obtained from the information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. We identified three limitations with respect to existing KG construction methods: (1) There could be a large amount of noise in real-world documents, which could result in extracting messy information. (2) Naive LLMs usually extract inaccurate knowledge from some domain-specific documents. (3) Hallucination phenomenon cannot be overlooked when directly using LLMs to construct KGs. In this paper, we propose \textbf{GraphJudge}, a KG construction framework to address the aforementioned challenges. In this framework, we designed an entity-centric strategy to eliminate the noise information in the documents. And we fine-tuned a LLM as a graph judge to finally enhance the quality of generated KGs. Experiments conducted on two general and one domain-specific text-graph pair datasets demonstrate state-of-the-art performance against various baseline methods with strong generalization abilities. Our code is available at \href{https://github.com/hhy-huang/GraphJudge&#125;&#123;https://github.com/hhy-huang/GraphJudge&#125;.

摘要

在现实场景中,从信息检索(IR)系统获取的数据大多是非结构化的。将自然语言句子转换为结构化的知识图谱(KG)仍然是一个关键挑战。我们总结了现有知识图谱构建方法的三个局限性:(1)真实文档可能包含大量噪声,导致提取的信息混乱;(2)未经优化的大语言模型(LLM)在处理特定领域文档时通常提取的知识不准确;(3)直接使用LLM构建知识图谱时存在的幻觉现象不可忽视。本文提出\textbf{GraphJudge}框架来解决上述挑战。该框架设计了以实体为中心的策略来消除文档中的噪声信息,并通过微调LLM作为图谱裁判来最终提升生成知识图谱的质量。在两个通用数据集和一个领域特定文本-图谱配对数据集上的实验表明,本方法相较于多种基线模型具有最先进的性能表现和强大的泛化能力。代码已开源于\href{https://github.com/hhy-huang/GraphJudge&#125;&#123;https://github.com/hhy-huang/GraphJudge&#125;。


Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models

Abstract

arXiv:2411.02448v3 Announce Type: replace-cross Abstract: LLMs have demonstrated impressive proficiency in generating coherent and high-quality text, making them valuable across a range of text-generation tasks. However, rigorous evaluation of this generated content is crucial, as ensuring its quality remains a significant challenge due to persistent issues such as factual inaccuracies and hallucination. This paper introduces three fine-tuned general-purpose LLM autoevaluators, REC-8B, REC-12B and REC-70B, specifically designed to evaluate generated text across several dimensions: faithfulness, instruction following, coherence, and completeness. These models not only provide ratings for these metrics but also offer detailed explanation and verifiable citation, thereby enhancing trust in the content. Moreover, the models support various citation modes, accommodating different requirements for latency and granularity. Extensive evaluations on diverse benchmarks demonstrate that our general-purpose LLM auto-evaluator, REC-70B, outperforms state-of-the-art LLMs, excelling in content evaluation by delivering better quality explanation and citation with minimal bias. Our REC dataset and models are available at https://github.com/adelaidehsu/REC.

摘要

大语言模型(LLMs)在生成连贯且高质量的文本方面表现出卓越能力,使其在多种文本生成任务中具有重要价值。然而,由于存在事实性错误和幻觉等持续性问题,确保生成内容质量仍面临重大挑战,因此对其进行严格评估至关重要。本文介绍了三种经过微调的通用LLM自动评估模型——REC-8B、REC-12B和REC-70B,这些模型专门设计用于从多个维度评估生成文本:忠实性、指令遵循性、连贯性和完整性。这些模型不仅提供各项指标的评分,还能生成详细解释和可验证的引用,从而增强对内容的信任度。此外,模型支持多种引用模式,可满足对延迟和粒度的不同需求。在多样化基准测试上的广泛评估表明,我们的通用LLM自动评估模型REC-70B优于当前最先进的LLMs,在内容评估方面表现卓越,能够以最小偏差提供更高质量的解释和引用。我们的REC数据集和模型已发布于https://github.com/adelaidehsu/REC。


Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts

Abstract

arXiv:2501.02009v2 Announce Type: replace-cross Abstract: Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.

摘要

理解大型语言模型(LLMs)的内部机制是一个关键的研究前沿。先前研究表明,单个LLM的概念表征可被提取为转向向量(SVs),从而实现对模型行为的控制(例如诱导生成有害内容)。本研究采用创新方法,通过探究不同LLM间概念表征的复杂关联,与柏拉图的洞穴寓言形成有趣类比。具体而言,我们提出一种线性变换方法来桥接这些表征,并呈现三个重要发现:1)不同LLM间的概念表征可通过简单线性变换实现有效对齐,使得基于SVs的跨模型迁移和行为控制成为可能;2)该线性变换具有概念泛化性,可协调不同LLM中表征各类概念的SVs;3)LLM概念表征存在弱到强的可迁移性,即从小型LLM提取的SVs能有效控制更大规模LLM的行为。


TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Abstract

arXiv:2501.07482v2 Announce Type: replace-cross Abstract: As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation in LLM training. We also observe a Pearson correlation of more than 0.7 between models' performance in TiEBe and various countries' socioeconomic indicators, such as HDI. In addition, we examine the impact of language on factual recall by posing questions in the native language of the region where each event occurred, uncovering substantial performance gaps for low-resource languages.

摘要

随着知识领域的演进和大语言模型(LLMs)的日益普及,如何使这些模型及时掌握时事动态的需求愈发迫切。现有基准测试主要评估通用事实召回能力,但针对LLMs跨时间或跨区域知识保留的研究仍属空白。为此,我们提出时效事件基准(TiEBe),这是一个包含23,000余个问答对的数据集,聚焦跨越10年以上、涵盖23个地区和13种语言的全球及区域重大事件。TiEBe利用维基百科的结构化回溯数据识别历史重大事件,并基于维基百科之外的事实证据构建评估基准,用以检验LLMs对全球和区域发展的理解。研究结果揭示了事实召回能力存在显著地域差异,凸显了LLM训练中全球代表性平衡的必要性。同时发现模型在TiEBe的表现与各国人类发展指数(HDI)等社会经济指标之间存在超过0.7的皮尔逊相关性。此外,我们通过采用事件发生地母语提问的方式探究语言对事实召回的影响,发现低资源语言存在显著的性能差距。


People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

Abstract

arXiv:2501.15654v2 Announce Type: replace-cross Abstract: In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.

摘要

本文研究了人类对商用大型语言模型(GPT-4o、Claude、o1)生成文本的检测能力。我们聘请标注人员阅读300篇非虚构类英文文章,将其标注为人类撰写或AI生成,并要求提供段落长度的决策解释。实验表明,经常使用LLM完成写作任务的标注者即使未经专门训练或反馈,也能出色识别AI生成文本。事实上,五位此类"专家"标注者的多数表决在300篇文章中仅误判1篇,其表现显著优于我们评估的大多数商用和开源检测器——即使在存在改写、人性化等规避策略的情况下亦然。对专家自由形式解释的定性分析显示,虽然他们高度依赖特定词汇线索("AI词汇"),但也能捕捉文本中更复杂的现象(如正式度、原创性、清晰度),这些是自动检测器难以评估的。我们公开标注数据集和代码,以推动未来关于人工与自动化检测AI生成文本的研究。


MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models

Abstract

arXiv:2502.11051v3 Announce Type: replace-cross Abstract: Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we develop a novel geometry-constrained gradient ascent method MMUnlearner. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code will be released upon acceptance.

摘要

机器学习遗忘(MU)领域的最新进展为深度神经网络中编码的私有或敏感信息的选择性删除提供了解决方案。然而,针对多模态大语言模型(MLLMs)的遗忘研究仍处于起步阶段。为此,我们提出在MLLMs时代重新定义多模态遗忘任务,其目标是在保留语言模型骨干原始参数中编码的相应文本知识的同时,仅消除与给定实体相关的视觉模式。此外,我们开发了一种新颖的几何约束梯度上升方法MMUnlearner。该方法在遗忘过程中通过剩余概念和文本知识共同约束的权重显著性图来更新MLLMs的权重,从而保留对非目标知识至关重要的参数。大量实验表明,在所有评估维度上,MMUnlearner均优于直接通过梯度上升(GA)或负偏好优化(NPO)使用视觉问答数据微调MLLMs的基线方法。我们的代码将在论文录用后公开。


Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Abstract

arXiv:2502.02789v2 Announce Type: replace-cross Abstract: Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7×\times maximal end-to-end QPS on real downstream tasks and 7.66×\times TTFT improvement.

摘要

提升首令牌生成时间(TTFT)是现代大语言模型(LLM)推理引擎的核心优化目标。直接优化TTFT可显著提高最大QPS(每秒查询数),满足诸多关键应用场景的需求。然而,由于该过程受计算资源限制且性能瓶颈从先前研究关注的自注意力机制转移至MLP部分,TTFT优化面临显著挑战。本研究提出SpecPrefill框架,该免训练方案基于以下发现实现长短上下文查询的TTFT加速:LLM具备足够泛化能力,仅需处理经精心筛选的提示词元子集即可保持输出质量。其核心在于利用轻量级模型推测上下文中的局部关键词元,这些词元与必要的位置信息共同输入主模型进行处理。我们通过多样化任务集评估SpecPrefill,并在真实端到端场景与消融实验中进行全面性能基准测试。实验表明,SpecPrefill在真实下游任务中可为Llama-3.1-405B-Instruct-FP8模型实现最高7倍的端到端QPS提升,并取得7.66倍的TTFT改进。


DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Abstract

arXiv:2502.14037v2 Announce Type: replace-cross Abstract: Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose three new decoding methods that leverage a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. Experiments concerning math problem solving, extreme summarization, and the divergent association task demonstrate that our approach consistently performs at least as well as existing methods in terms of quality and diversity.

摘要

尽管语言模型的能力不断增强,但其仍频繁复现训练数据内容、生成重复文本,并倾向于使用常见的语法模式和词汇。一个可能的原因是解码策略:最常见的策略要么仅考虑概率最高的标记(这会降低输出多样性),要么增加低概率标记的可能性(从而影响输出的准确性和正确性)。本文提出三种新的解码方法,通过数学分析标记概率分布来确保生成上下文恰当的文本。特别是,可利用连续排序概率之间的差异来截断错误标记。在数学问题求解、极端摘要和发散联想任务的实验中,我们的方法在质量和多样性方面始终表现不逊于现有方法。


Online Scheduling for LLM Inference with KV Cache Constraints

Abstract

arXiv:2502.07115v4 Announce Type: replace-cross Abstract: Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose a novel batching and scheduling algorithm that minimizes inference latency while effectively managing the KV cache's memory. More specifically, we make the following contributions. First, to evaluate the performance of online algorithms for scheduling in LLM inference, we introduce a hindsight optimal benchmark, formulated as an integer program that computes the minimum total inference latency under full future information. Second, we prove that no deterministic online algorithm can achieve a constant competitive ratio when the arrival process is arbitrary. Third, motivated by the computational intractability of solving the integer program at scale, we propose a polynomial-time online scheduling algorithm and show that under certain conditions it can achieve a constant competitive ratio. We also demonstrate our algorithm's strong empirical performance by comparing it to the hindsight optimal in a synthetic dataset. Finally, we conduct empirical evaluations on a real-world public LLM inference dataset, simulating the Llama2-70B model on A100 GPUs, and show that our algorithm significantly outperforms the benchmark algorithms. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.

摘要

大语言模型(LLM)推理是一个计算密集型过程,其中训练好的模型根据用户提示逐词生成文本,需要高效调度以优化延迟和资源利用率。LLM推理的一个关键挑战是键值(KV)缓存的管理,该缓存虽能减少冗余计算,但会带来内存限制。本研究从理论上建立了带KV缓存约束的LLM推理模型,并提出一种新颖的批处理与调度算法,在有效管理KV缓存内存的同时最小化推理延迟。 具体而言,我们做出以下贡献:首先,为评估LLM推理调度在线算法的性能,我们引入事后最优基准,将其表述为整数规划问题,计算在完全未来信息下的最小总推理延迟;其次,我们证明当请求到达过程任意时,任何确定性在线算法都无法达到恒定竞争比;第三,针对大规模整数规划求解的计算难题,我们提出一种多项式时间在线调度算法,并证明在特定条件下该算法可实现恒定竞争比。通过合成数据集与事后最优解的对比,我们还验证了算法的强实证性能。最后,我们在真实世界公开LLM推理数据集上进行了实证评估,模拟A100 GPU运行Llama2-70B模型,结果表明本算法显著优于基准算法。总体而言,我们的研究结果为更可持续、更具成本效益的LLM部署提供了路径。


EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Abstract

arXiv:2502.11916v2 Announce Type: replace-cross Abstract: Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research.

摘要

自动作文评分(AES)在教育评估中发挥着关键作用,通过对写作任务进行可扩展且一致的评估。然而,传统AES系统面临三大挑战:(1)依赖手工制作的特征,限制了泛化能力;(2)难以捕捉如连贯性和论证性等细粒度特征;(3)无法处理多模态上下文。在多模态大语言模型(MLLMs)时代,我们提出了EssayJudge,这是首个评估AES在词汇、句子和篇章层面特征能力的多模态基准。通过利用MLLMs在特定特征评分和多模态上下文理解方面的优势,EssayJudge旨在提供精确、上下文丰富的评估,而无需手动特征工程,从而解决长期存在的AES局限性。我们对18个代表性MLLMs的实验表明,与人工评估相比,AES性能存在差距,尤其是在篇章层面特征上,这凸显了基于MLLM的AES研究需要进一步推进。


Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection

Abstract

arXiv:2502.13061v2 Announce Type: replace-cross Abstract: Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While LMMs have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both SFT and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability.

摘要

仇恨表情包已成为互联网上的重要问题,亟需建立强大的自动化检测系统。尽管大型多模态模型(LMM)在仇恨表情包检测中展现出潜力,但仍面临性能欠佳和跨领域泛化能力有限等显著挑战。最新研究进一步揭示了在此场景下,无论是监督微调(SFT)还是上下文学习应用于LMM时都存在局限性。为解决这些问题,我们提出了一种鲁棒的仇恨表情包检测适配框架,该框架在保持LMM通用视觉-语言能力的同时,提升了领域内准确性和跨领域泛化能力。在六个表情包分类数据集上的实验表明,我们的方法实现了最先进的性能表现,优于规模更大的代理系统。此外,与标准监督微调相比,我们的方法能生成更高质量的解释性依据来阐明仇恨内容,从而增强了模型的可解释性。


Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs

Abstract

arXiv:2502.16901v2 Announce Type: replace-cross Abstract: We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model's architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.

摘要

我们研究了多语言大语言模型(mLLMs)中的跨语言后门攻击(X-BAT),揭示了通过共享嵌入空间,植入一种语言的后门如何自动迁移至其他语言。以毒性分类为案例,我们证明攻击者仅需污染单一语言数据即可危害多语言系统,其中罕见高频词元可作为特定有效的触发器。研究结果暴露了模型架构中存在的重要漏洞,该漏洞会在信息流中产生隐蔽的后门效应。我们的代码和数据已公开于https://github.com/himanshubeniwal/X-BAT。


EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking

Abstract

arXiv:2502.12466v2 Announce Type: replace-cross Abstract: As large language models (LLMs) become integral to code-related tasks, a central question emerges: do LLMs truly understand program execution semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's understanding of code execution semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. The transformations span syntactic edits, structural modifications, and algorithmic changes, covering a broad spectrum of semantic variation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning over execution semantics, highlighting fundamental limitations.

摘要

随着大语言模型(LLMs)在代码相关任务中的广泛应用,一个核心问题浮现:LLMs是否真正理解程序执行语义?我们提出EquiBench——一个通过等价性检验评估LLMs的新基准,即判断两个程序在所有可能输入下是否产生相同输出。与现有代码生成基准不同,该任务直接检验模型对代码执行语义的理解能力。EquiBench包含2400个跨四种编程语言和六大类别的程序对,这些程序对通过程序分析、编译器调度和超级优化技术生成,确保标签高置信度、难度非平凡且实现全自动化。其转换操作涵盖语法编辑、结构修改和算法变更,覆盖广泛的语义变异谱系。我们对19个前沿LLMs进行评估发现:在最具挑战性的类别中,最佳准确率仅为63.8%和76.2%,仅略高于50%的随机基线。进一步分析表明,模型往往依赖语法相似性而非对执行语义的稳健推理,这揭示了其根本性局限。


SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

Abstract

arXiv:2502.16747v2 Announce Type: replace-cross Abstract: Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.

摘要

开源权重的大型语言模型(LLMs)在自然语言转SQL(NL2SQL)任务中显著提升了性能表现。然而当处理大规模数据库模式时,由于上下文长度增加,其效能会明显下降。针对这一局限,我们提出SQLong——一种新颖高效的数据增强框架,专为提升LLMs在NL2SQL长上下文场景中的性能而设计。该框架通过扩展现有数据库模式来生成增强数据集,具体方式是从训练数据的不同模式中采样,添加合成的CREATE TABLE命令及对应数据行。这种方法在微调和评估阶段有效模拟了长上下文场景。通过在Spider和BIRD数据集上的实验,我们证实使用SQLong增强数据微调的LLMs性能显著优于标准数据集训练的模型。这表明SQLong具有实际应用价值,能够有效提升现实场景中复杂数据库模式下的NL2SQL能力。


DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Abstract

arXiv:2502.12623v2 Announce Type: replace-cross Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets.

摘要

音乐大语言模型(LLMs)的最新进展显著提升了音乐理解任务的表现,这类任务涉及模型分析和解读多种音乐元素的能力。现有改进主要集中于整合音乐与文本输入,然而利用图像、视频及文本音乐特征等多模态数据增强音乐理解的潜力尚未被探索。为此,我们提出DeepResonance——一个通过多路径指令调优框架微调的多模态音乐理解大语言模型,该框架整合了音乐、文本、图像和视频的多模态对齐数据。为此,我们构建了Music4way-MI2T、Music4way-MV2T和Music4way-Any2T三个四路径训练与评估数据集,旨在使DeepResonance能够融合视觉与文本音乐特征内容。我们还引入了多重采样ImageBind嵌入和预LLM融合Transformer,以增强输入文本大语言模型前的多模态融合能力,从而优化模型的多路径指令调优性能。实验表明,我们的模型在六项音乐理解任务中均达到最先进水平,验证了辅助模态的有效性及DeepResonance的结构优势。我们将开源模型及新构建的数据集。


R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Abstract

arXiv:2502.12767v5 Announce Type: replace-cross Abstract: Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.

摘要

近期研究将大型语言模型(LLMs)与知识图谱(KGs)相结合以增强推理能力,在无需额外训练的情况下提升推断准确性并减少幻觉现象。然而现有框架仍存在两个实际缺陷:每当知识图谱或推理任务变更时需重新调参,且依赖单一高容量LLM进行可靠(即可信)推理。为此,我们提出R2-KG——一种即插即用的双智能体框架,将推理过程分解为两个角色:负责收集证据的操作员(低容量LLM)与做出最终判定的监督员(高容量LLM)。该设计在保持强推理准确性的同时,显著降低了LLM推理成本。此外,R2-KG采用弃权机制,仅在从知识图谱收集到充分证据时生成答案,从而大幅提升可靠性。在五个多样化基准测试中的实验表明,无论操作员LLM的固有能力如何,R2-KG在准确性与可靠性方面均持续优于基线方法。进一步实验揭示,配备严格自洽策略的R2-KG单智能体版本能以更低推理成本实现显著高于基线的可靠性,但在复杂知识图谱中弃权率会升高。本研究证实R2-KG是一种灵活且高性价比的知识图谱推理解决方案,在降低对高容量LLM依赖的同时确保可信推断。代码已开源:https://github.com/ekrxjwh2009/R2-KG/。


Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More

Abstract

arXiv:2503.10542v2 Announce Type: replace-cross Abstract: This work concerns the path-star task, a minimal example of searching over a graph. The graph, GG, is star-shaped with DD arms radiating from a start node, ss. A language model (LM) is given GG, ss, and a target node tt, which ends one of the arms and is tasked with generating the arm containing tt. The minimal nature of this task means only a single choice needs to be made: which of the DD arms contains tt? Decoder-only LMs fail to solve this elementary task above 1/D1/D chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task's minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.

摘要

本工作研究路径-星型任务,这是图搜索问题的一个最小化示例。该图GG呈星形结构,由起始节点ss向外辐射出DD条臂。语言模型(LM)被给定图GG、起始节点ss和目标节点tt(位于其中一条臂的末端),其任务是生成包含tt的臂。该任务的极简特性意味着只需做出单一选择:DD条臂中哪一条包含tt

仅解码器架构的语言模型无法突破1/D1/D概率解决这一基础任务,这是由于学习到的捷径吸收了训练监督信号。我们揭示了这种病理现象源于过度监督,并提出一系列解决方案证明该任务可通过仅解码器LM求解。研究发现,任务的最小化特性通过阻碍任务分解导致其难度。这些解决方案为理解该病理现象及其对基于下一词预测训练的LM的影响提供了启示。


TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Abstract

arXiv:2502.13442v2 Announce Type: replace-cross Abstract: Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.

摘要

当前大型语言模型(LLMs)在标准数学应用题基准测试(如GSM8K)上已达到接近人类的水平,但其真实推理能力仍存争议。核心问题在于模型经常对不可解问题生成自信却无依据的答案。我们提出TreeCut——一个通过将每个问题表示为树结构并移除选定必要条件,系统化生成无限不可解数学应用题及其可解对应项的合成数据集。实验表明,TreeCut能有效诱发包括GPT-4o和o3-mini在内的大型语言模型的幻觉现象,在零样本设置下的最坏场景中幻觉率分别达到64%和44%。进一步分析揭示:更深或更复杂的树结构、复合项目名称、以及在路径中部移除必要条件均会提高幻觉发生概率,这凸显了LLMs在识别不可解数学问题方面持续面临的挑战。数据集生成代码与样本数据详见https://github.com/j-bagel/treecut-math。


MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Abstract

arXiv:2503.12931v2 Announce Type: replace-cross Abstract: Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of universal defense against diverse jailbreaks. We propose a new concept ``mirror'', which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

摘要

保护大语言模型(LLMs)免受越狱攻击对于确保其安全部署至关重要。现有防御策略通常依赖于预定义的静态标准来区分有害和良性提示。然而,这种僵化的规则无法适应现实世界越狱攻击固有的复杂性和动态特性。本文聚焦于针对多样化越狱攻击的通用防御这一新挑战,提出新概念“镜像”——即动态生成的提示,该提示能反映输入的句法结构同时确保语义安全性。输入提示与其对应镜像之间的差异可作为防御的指导原则。进一步提出的新型防御模型MirrorShield基于构建的镜像实现风险输入的检测与校准。通过在多个基准数据集上的评估以及与十种最先进攻击方法的对比实验,MirrorShield展现出卓越的防御性能和良好的泛化能力。


RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs

Abstract

arXiv:2503.10657v2 Announce Type: replace-cross Abstract: Routing large language models (LLMs) is a new paradigm that uses a router to recommend the best LLM from a pool of candidates for a given input. In this paper, our comprehensive analysis with more than 8,500 LLMs reveals a novel model-level scaling up phenomenon in Routing LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even surpass the performance of the best single model in the pool and many existing strong LLMs, confirming it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark tailored for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across various areas such as commonsense reasoning, semantic understanding, etc., based on over 8,500 various LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement. See https://github.com/MilkThink-Lab/RouterEval for all data, code and tutorial.

摘要

大型语言模型(LLMs)路由是一种新兴范式,其通过路由器从候选模型池中为给定输入推荐最优LLM。本文基于对8,500余个LLMs的综合分析,首次揭示了路由LLMs中存在模型级规模扩展现象:随着候选模型数量增加,高性能路由器能显著提升该范式的表现。这种提升甚至可超越候选池中最佳单模型及诸多现有强LLM的性能,证实其为极具前景的研究方向。然而,当前缺乏全面开源的路由LLMs基准制约了路由器的发展。为此,我们提出专用于路由器研究的基准测试RouterEval,其基于8,500多个多样化LLM,涵盖常识推理、语义理解等12个主流评测领域的逾200,000,000条性能记录。通过RouterEval对现有路由LLM方法的广泛评估表明,大多数方法仍存在显著改进空间。所有数据、代码及教程详见https://github.com/MilkThink-Lab/RouterEval。


Cost-Optimal Grouped-Query Attention for Long-Context Modeling

Abstract

arXiv:2503.09579v2 Announce Type: replace-cross Abstract: Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should also vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. More importantly, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3's GQA, with no degradation in model capabilities. Our findings offer valuable insights for designing efficient long-context LLMs. The code is available at https://www.github.com/THUNLP/cost-optimal-gqa .

摘要

分组查询注意力(GQA)是降低大语言模型(LLM)注意力层计算成本的常用策略。然而,现有GQA配置通常并非最优,因其忽视了上下文长度对推理成本的影响。鉴于推理成本随上下文长度增长,最高效的GQA配置也应相应调整。本研究通过分析上下文长度、模型规模、GQA配置与模型损失之间的关系,提出两项创新:(1)将总头尺寸与隐藏尺寸解耦,实现对注意力浮点运算的更灵活控制;(2)联合优化模型规模与GQA配置,从而在注意力层与其他组件间实现更优的推理资源分配。分析表明,常用GQA配置在长上下文场景中效率显著不足。更重要的是,我们提出了推导成本最优GQA配置的方案。实验结果显示,针对长上下文场景,应减少注意力头数量并扩大模型规模。相比Llama-3的GQA配置,本方案选取的配置可降低超过50%的内存占用与浮点运算量,且模型性能无任何衰减。这些发现为设计高效的长上下文大语言模型提供了重要参考。代码已开源:https://www.github.com/THUNLP/cost-optimal-gqa。


CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models

Abstract

arXiv:2503.14232v2 Announce Type: replace-cross Abstract: Text-to-Image diffusion models can produce undesirable content that necessitates concept erasure. However, existing methods struggle with under-erasure, leaving residual traces of targeted concepts, or over-erasure, mistakenly eliminating unrelated but visually similar concepts. To address these limitations, we introduce CRCE, a novel concept erasure framework that leverages Large Language Models to identify both semantically related concepts that should be erased alongside the target and distinct concepts that should be preserved. By explicitly modelling coreferential and retained concepts semantically, CRCE enables more precise concept removal, without unintended erasure. Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks, including real-world object, person identities, and abstract intellectual property characteristics. The constructed dataset CorefConcept and the source code will be release upon acceptance.

摘要

文本到图像扩散模型可能生成需要概念消除的不良内容。然而,现有方法存在消除不足(残留目标概念痕迹)或消除过度(误删视觉相似但无关概念)的问题。为解决这些局限,我们提出CRCE——一种新型概念消除框架,其利用大语言模型识别应与目标概念同时消除的语义相关概念及应保留的显著概念。通过显式建模指代核心概念与保留概念的语义关系,CRCE能实现更精准的概念移除,避免非预期消除。实验表明,CRCE在多样化消除任务(包括真实物体、人物身份及抽象知识产权特征)上优于现有方法。构建的数据集CorefConcept及源代码将在论文录用后公开。


LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Abstract

arXiv:2503.13794v2 Announce Type: replace-cross Abstract: Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

摘要

基于大规模视觉语言数据训练的基础模型能通过合成训练数据提升开放词汇目标检测(OVD)性能,但手工设计的流程常会引入偏差并过度拟合特定提示。我们通过直接将大型语言模型(LLM)的隐藏状态融合至检测器来规避该问题——这一途径目前鲜有研究。本文提出系统性方法,利用多模态大语言模型(MLLM)中LLM的解码器层增强视觉定位能力。我们引入零初始化交叉注意力适配器,实现从LLM到目标检测器的高效知识融合,该方法称为LED(LLM增强型开放词汇目标检测)。研究发现LLM中间层已编码丰富空间语义,仅适配早期层即可获得大部分性能提升。以Swin-T作为视觉编码器时,Qwen2-0.5B+LED在OmniLabel数据集上将GroundingDINO性能提升3.82%,额外计算量仅增加8.7%;采用更大视觉骨干网络时,改进幅度可达6.22%。针对适配器变体、LLM规模及融合深度的广泛消融实验进一步验证了本方案的设计合理性。


S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

Abstract

arXiv:2504.10368v2 Announce Type: replace-cross Abstract: We introduce S1-Bench, a novel benchmark designed to evaluate the performance of Large Reasoning Models (LRMs) on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their heavy reliance on system 2 thinking may limit their system 1 thinking capabilities. However, there is a lack of an appropriate benchmark for evaluating LRM's system 1 thinking capabilities. To fill this gap, S1-Bench introduces a suite of simple, diverse, and natural questions across multiple domains and languages, specifically designed to assess LRMs' performance on questions more suitable for system 1 . We conduct extensive evaluations across 28 LRMs, revealing their inefficiency, inadequate accuracy, and limited robustness when handling simple questions. Additionally, we observe a gap between their difficulty perception and generation length. Overall, this work paves the way toward dual-system compatibility in the development of LRMs.

摘要

我们提出S1-Bench这一新型基准测试,旨在评估大型推理模型(LRMs)在偏向直觉性系统1思维的简单任务上的表现。尽管LRMs通过显式的思维链在复杂推理任务中取得重大突破,但其对系统2思维的过度依赖可能限制系统1思维能力。然而目前缺乏评估LRMs系统1思维能力的合适基准。为此,S1-Bench构建了跨领域、多语言且简单多样的自然问题集,专门用于评估LRMs在更适合系统1处理的问题上的表现。我们对28个LRMs开展广泛评估,发现其在处理简单问题时存在效率低下、准确性不足和鲁棒性有限等问题。此外还观察到模型对问题难度的认知与生成长度之间存在差距。总体而言,本研究为LRMs开发中实现双系统兼容性开辟了道路。


Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Abstract

arXiv:2504.08399v2 Announce Type: replace-cross Abstract: Self-report questionnaires have long been used to assess LLM personality traits, yet they fail to capture behavioral nuances due to biases and meta-knowledge contamination. This paper proposes a novel multi-observer framework for personality trait assessments in LLM agents that draws on informant-report methods in psychology. Instead of relying on self-assessments, we employ multiple observer agents. Each observer is configured with a specific relational context (e.g., family member, friend, or coworker) and engages the subject LLM in dialogue before evaluating its behavior across the Big Five dimensions. We show that these observer-report ratings align more closely with human judgments than traditional self-reports and reveal systematic biases in LLM self-assessments. We also found that aggregating responses from 5 to 7 observers reduces systematic biases and achieves optimal reliability. Our results highlight the role of relationship context in perceiving personality and demonstrate that a multi-observer paradigm offers a more reliable, context-sensitive approach to evaluating LLM personality traits.

摘要

传统自陈式问卷长期被用于评估大语言模型(LLM)的人格特质,但由于存在偏见和元知识污染,其无法捕捉行为层面的细微差异。本文提出一种新颖的多观察者框架,借鉴心理学中的知情者报告方法,用于LLM智能体的人格特质评估。该框架摒弃自我评估方式,转而采用多个观察者智能体。每个观察者被配置特定的关系情境(如家庭成员、朋友或同事),通过与目标LLM进行对话后,在大五人格维度上对其行为进行评价。研究表明,相较于传统自陈报告,这些观察者报告评分与人类判断具有更高一致性,并揭示了LLM自我评估中存在的系统性偏差。实验还发现,聚合5至7名观察者的反馈可有效降低系统性偏差,达到最佳信度。研究结果凸显了关系情境在人格感知中的关键作用,证明多观察者范式能为LLM人格特质评估提供更可靠且情境敏感的研究路径。


Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding

Abstract

arXiv:2504.01281v3 Announce Type: replace-cross Abstract: We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.

摘要

我们提出一个通过动态检索策略与强化微调来增强检索增强生成(RAG)系统的综合性框架。该方法显著提升了大型语言模型在知识密集型任务上的表现,包括开放域问答和复杂推理。我们的框架整合了两种互补技术:策略优化检索增强生成(PORAG)——优化检索信息的使用,以及自适应词层注意力评分(ATLAS)——根据上下文需求动态确定检索时机与内容。这些技术共同提升了检索内容的利用效率与相关性,改善了事实准确性与响应质量。该框架作为轻量级解决方案设计,无需额外训练即可兼容任何基于Transformer的大语言模型,在知识密集型任务中表现优异,显著提高了RAG场景下的输出准确性。我们进一步提出CRITIC方法,通过词元重要性选择性压缩键值缓存,缓解长上下文应用中的内存瓶颈。该框架还融合了测试时缩放技术以动态平衡推理深度与计算资源,同时采用优化解码策略加速推理。基准数据集实验表明,相比传统RAG系统,我们的框架减少了幻觉现象,强化了领域特定推理能力,并在效率与可扩展性方面取得显著提升。这种集成方法推动了跨领域应用的鲁棒、高效、可扩展RAG系统的发展。


Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Abstract

arXiv:2504.14150v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

摘要

大语言模型(LLMs)能够生成关于其如何得出问题答案的合理解释。然而,这些解释可能歪曲模型的“推理”过程,即它们可能是不真实的。这进而会导致过度信任和误用。我们提出了一种测量LLM解释真实性的新方法。首先,我们对真实性给出了严格定义。由于LLM解释模仿人类解释,它们经常引用输入问题中据称影响模型的高层次概念。我们根据LLM解释暗示有影响的概念集与真正有影响的概念集之间的差异来定义真实性。其次,我们提出了一种新颖的估计真实性的方法,该方法基于:(1)使用辅助LLM修改模型输入中概念的值以创建现实的反事实,(2)使用贝叶斯分层模型在示例级和数据集级量化概念的因果效应。我们的实验表明,该方法可用于量化和发现不真实性的可解释模式。在一个社会偏见任务中,我们发现了LLM解释隐藏社会偏见影响的情况。在一个医学问答任务中,我们发现了LLM解释对哪些证据影响了模型决策提供误导性主张的情况。


LLM-hRIC: LLM-empowered Hierarchical RAN Intelligent Control for O-RAN

Abstract

arXiv:2504.18062v2 Announce Type: replace-cross Abstract: Despite recent advances in applying large language models (LLMs) and machine learning (ML) techniques to open radio access network (O-RAN), critical challenges remain, such as insufficient cooperation between radio access network (RAN) intelligent controllers (RICs), high computational demands hindering real-time decisions, and the lack of domain-specific finetuning. Therefore, this article introduces the LLM-empowered hierarchical RIC (LLM-hRIC) framework to improve the collaboration between RICs in O-RAN. The LLM-empowered non-real-time RIC (non-RT RIC) acts as a guider, offering a strategic guidance to the near-real-time RIC (near-RT RIC) using global network information. The RL-empowered near-RT RIC acts as an implementer, combining this guidance with local real-time data to make near-RT decisions. We evaluate the feasibility and performance of the LLM-hRIC framework in an integrated access and backhaul (IAB) network setting, and finally, discuss the open challenges of the LLM-hRIC framework for O-RAN.

摘要

尽管近年来在将大语言模型(LLM)和机器学习(ML)技术应用于开放无线接入网(O-RAN)方面取得了进展,但仍存在关键挑战,例如无线接入网(RAN)智能控制器(RIC)之间协作不足、高计算需求阻碍实时决策,以及缺乏领域特定的微调。因此,本文提出了基于LLM的分层RIC(LLM-hRIC)框架,以改善O-RAN中RIC之间的协作。基于LLM的非实时RIC(non-RT RIC)充当指导者,利用全局网络信息为近实时RIC(near-RT RIC)提供策略指导。基于强化学习的近实时RIC(near-RT RIC)充当执行者,将该指导与本地实时数据结合以做出近实时决策。我们在集成接入与回传(IAB)网络环境中评估了LLM-hRIC框架的可行性和性能,最后讨论了该框架在O-RAN中面临的开放挑战。


Adaptive Thinking via Mode Policy Optimization for Social Language Agents

Abstract

arXiv:2505.02156v3 Announce Type: replace-cross Abstract: Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack this kind of reasoning capability or enforce Long Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social simulation. To address this, we propose an \textbf&#123;A&#125;daptive \textbf&#123;M&#125;ode \textbf&#123;L&#125;earning (\textbf&#123;AML&#125;) framework in this paper, aiming to improve the adaptive thinking ability of language agents in dynamic social interactions. To this end, we first identify hierarchical thinking modes ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the \textbf&#123;A&#125;daptive \textbf&#123;M&#125;ode \textbf&#123;P&#125;olicy \textbf&#123;O&#125;ptimization (\textbf&#123;AMPO&#125;) algorithm to optimize the context-aware mode switching and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence benchmarks verify that AML achieves 15.6% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter reasoning chains, demonstrating the advantage of adaptive thinking mode selection and optimization mechanism in AMPO over GRPO's fixed-depth solution.

摘要

有效的社会智能模拟要求语言代理具备动态调整推理深度的能力,这一关键能力在当前研究中明显缺失。现有方法要么缺乏此类推理能力,要么在所有场景中强制采用长思维链式推理,导致令牌使用效率低下且社会模拟缺乏灵活性。为此,本文提出一种 extbf&#123;自适应模式学习&#125; extbf&#123;AML&#125;)框架,旨在提升语言代理在动态社交互动中的适应性思维能力。我们首先基于认知控制理论确立了从直觉反应到深度思考的层次化思维模式,进而开发了 extbf&#123;自适应模式策略优化&#125; extbf&#123;AMPO&#125;)算法以优化上下文感知的模式切换与推理机制。本框架在三个方面推进了现有研究:(1)多粒度思维模式设计;(2)跨社交互动的上下文感知模式切换;(3)通过深度自适应处理实现令牌高效推理。在社会智能基准测试上的大量实验表明,AML的任务表现比GPT-4o高出15.6%。值得注意的是,我们的AMPO以比GRPO短32.8%的推理链长度实现了7.0%的性能提升,这证明了AMPO中自适应思维模式选择与优化机制相对于GRPO固定深度解决方案的优越性。